Abstract
Reading comprehension tests are often assumed to measure the same, or at least similar, constructs. Yet, reading is not a single but a multidimensional form of processing, which means that variations in terms of reading material and item design may emphasize one aspect of the construct at the cost of another. The educational systems in Denmark, Norway, and Sweden share a number of traits, and in the recent decade, the development of national test instruments, especially for reading, has been highly influenced by international surveys of student achievement. In this study, national tests of L1 reading comprehension in secondary school in the three Scandinavian countries are compared in order to reveal the present range of diversity/commonality within the three test domains. The analysis employs both qualitative and quantitative aspects of data, including frameworks, text samples, task samples, and scoring guidelines from 2011 to 2014. Findings indicate that the three tests differ substantially from each other, not only in terms of the intentional and operative constructs of reading to be measured, but also in terms of testing methods and stability over time. Implications for the future development of reading comprehension assessment are discussed.
There is a growing awareness that changes in the way we assess educational progress have major implications for the development of teaching and learning. Research points to the fact that standardized tests such as national tests should be seen as part of the curriculum rather than external means of evaluating curriculum effects in terms of student learning (Forsberg & Lundahl, 2010). This is mainly owing to washback effects from high-stakes tests on policies and practices in school. In order to strengthen educational equity and increase assessment objectivity, standardized testing has for a number of years been, and is still being, intensified in the Scandinavian educational systems, as well as in many other countries (Dobson, Eggen, & Smith, 2009; Egelund, 2008; Wikström, 2009). This is well known. What different tests actually measure, however, is less known.
In part, the development towards a growing system of standardized testing is propelled by international comparisons, which influence our perception of national educational quality to a larger extent than before. Also, we are facing a change of beliefs in terms of what function large-scale evaluation should have for school and municipal economies, as well as for teachers and students in the classroom (Koretz, 2008; Stobart, 2008). Clearly, educational testing comprises a number of concerns, and in order to evaluate the range of inferences made possible by test results, we need profound insight into the rationales and theoretical underpinnings of prevalent test instruments. In short, we need to know what the tests are measuring, and what they are not measuring.
This article discusses the construction of national L1 reading tests in the Scandinavian countries. Based on essential tenets of current reading assessment research, the study reported is a comparison of the instruments designed to evaluate reading comprehension in secondary school in Denmark, Norway, and Sweden. Besides providing the general picture of purpose, timing, and structure of test instruments, the analysis aims at determining
in what way test domains and construct definitions vary between the national reading frameworks in Denmark, Norway, and Sweden, and
in what way reading material and item design vary between national reading tests in the three countries.
International comparisons influencing national standards
Although international surveys like PIRLS and PISA cannot identify cause-and-effect relationships between inputs, processes, and educational outcomes, the highlighting of system similarities and differences and of variance in educational progress has had an immense influence on educational policy making in several countries (Breakspear, 2012). In both Denmark and Norway, results in international achievement tests caused intense debates, determining it to be a matter of a national reading crisis, which eventually led to initiatives to change both national curricula and the cultures of testing and evaluation. In Denmark, following the initial publications of PISA findings, a series of national tests were introduced in compulsory school (Egelund, 2008) and an extensive program for educating reading coaches was introduced by reference to the official ambition that Denmark should be among the top five countries on the international PIRLS and PISA ranking lists (Danish Government, 2010). In Norway, the reading process model used to define test items in PISA was incorporated in the new national curriculum (LK 06). Arguing that a comprehensive theoretical framework of reading literacy like the one used by PISA was more adequate than any national alternative available, this process model provided the foundation for both the curriculum and the development of national tests in reading (Kulbrandstad, 2010; Roe & Lie, 2009). In Sweden, a curriculum reform in 2011 was clearly influenced by data from PIRLS 2006, showing that Swedish teachers spent less time on average than teachers from participating countries on providing explicit reading comprehension instruction (Swedish National Agency for Education, 2011a). Although construct definitions of reading comprehension in the Swedish curriculum remained unaffected by PIRLS, the development of national reading tests in grades 6 and 9 took the reading process model from PIRLS as a basis for defining both test items and scoring rules. This shift was not mandated by the Swedish National Agency for Education (henceforth Swedish National Agency). Rather, it was an informal adaptation to international standards set by the test developers themselves; whereas in Norway, the adjustment to the PISA study reflected a clear official request from the Government (Kulbrandstad, 2010). Generally, then, educational testing in Scandinavia seems to have been highly influenced by the international testing trends, and perhaps this is particularly true for the reading tests. It is therefore important to provide essential knowledge about the kind of reading inventory represented by the current tests in the three countries.
Assessment of reading comprehension
A multidimensional construct
Assessment of “reading” or “reading comprehension” is a critical component in all forms of national and international evaluation of educational quality. Yet, conceptualizing reading comprehension presents a particular challenge and it is often considered to be a multidimensional construct (Duke, 2005; van den Broek, 2012). Both contextual factors (text type, topic, and purpose) and the balancing of different subskills (retrieval of explicit information, interpreting content, or critical evaluation) are assumed to account for variance in comprehension measures (Best, Ozuru, Floyd, & McNamara, 2006; Keenan, Betjeman, & Olson, 2008; Leslie & Caldwell, 2009; Nation & Snowling, 1997; Snow, 2002).
Much research has been devoted to identifying the factors that best explain variance in comprehension, as well as exploring the intercorrelations among items and subdivisions of the construct (Cutting & Scarborough, 2006; Davis, 1968; Keenan et al., 2008; Spearitt, 1972). A growing body of research also indicates that different tests tend to measure different aspects of reading comprehension (Cutting & Scarborough, 2006; Francis, Fletcher, Catts, & Tomblin, 2005; Francis, Snow, August, Carlson, Miller, & Iglesisas, 2006; Keenan et al., 2008). As noted both by Nation and Snowling (1997) and later by Cutting and Scarborough (2006), these differences clearly influence the range of possible inferences to be made about students’ proficiency in reading, especially when it comes to identifying children with reading difficulties. Francis et al. (2005) argue that a promising route in future reading comprehension assessment research would be to do comparative studies of existing frameworks for assessment (“item properties, dimensionality, and generalizability”, p. 391). They also stress the need for deeper analyses of the very components of reading (processes, cognitive strategies, etc.) with which the existing assessment devices are actually dealing.
Item format differences
A number of studies examine the impact of item format and the range of achievement variation related to changes in testing framework and methodology. Some of these studies have indicated that a lion’s share of the construct may be targeted equally well by multiple-choice (MC) and constructed response (CR) formats, but that there are aspects of, for instance, literary reading for which MC items appear less appropriate (Campbell, 2005; Pearson & Hamm, 2005). Haladyna and Rodriguez (2013) comment that although there are some persistent beliefs about item differences, current research does not provide any clear answer as to whether MC and CR items tap the same or different cognitive functions in test-takers. For example, van den Bergh (1990) found that it was impossible “to demonstrate a substantial difference in intellectual abilities measured with either open-ended items or multiple-choice items for reading comprehension” (p. 9). Similarly, Farr, Pritchard, and Smitten (1990) argued that although the strategies used in answering MC questions may correspond less to non-test situation reading, they pretty much resemble the type of reading common in school, in which passages are often skimmed in search for particular information. However, item process validity should ultimately be considered in relation to the particular skill being tested. Pearson and Hamm (2005) report evidence that item format differences may emerge when tasks call for a deeper cognitive engagement, such as expecting test-takers to consider several stances towards a text (multiplicity), or to link ideas across different texts (intertextuality).
Other alternatives to standard MC items are Cloze and gap-filling techniques. In gap-filling tasks, some words are deleted from the text and the test-taker is expected to fill in the gap with the appropriate word from a list of alternatives. It has been questioned whether these techniques really measure higher-order reading comprehension or if they rely too heavily on micro language processing (Kintsch & Yarbrough, 1982). The cognitive target of an individual item obviously depends on which word is deleted, but the ability to solve the task is not necessarily constrained by global comprehension, but rather by local, lexical access and syntactic knowledge (cf. Alderson, 2000; Nation & Snowling, 1997). Still, the technique has been proven to discriminate well between skilled and less skilled readers (Yamashita, 2003).
Cognitive processes involved in reading comprehension
As for the particular components of reading comprehension, Keenan et al. (2008) comment that although we are often well informed about the practicalities of a particular test (e.g., passage length, item format, time allotment, and reliability measures), there is often less information offered about the cognitive processes measured by various tests. This probably reflects the remaining indistinctiveness that characterizes existing theoretical frameworks of reading comprehension. Highly influenced by Bloom’s (1956) hierarchical taxonomy of cognitive domains of knowledge, many standardized tests of reading categorize comprehension components by a model of cognitive complexity (from locating explicitly mentioned pieces of information, to inferencing, interpreting, and critically evaluating text) (Alderson, 2000; Khalifa & Weir, 2009; Pearson & Hamm, 2005). The threefold divide of cognitive targets, or mental processes, used in PISA is an example of this, 1 which has influenced not only the Scandinavian reading tests, but also the American National Assessment of Educational Progress (NAEP). The process models are used for developing test items and sometimes for developing subscales by which results are reported. It is not clear, however, that these categories are conceptually and psychometrically distinct from each other (cf. Henning, 1992). Empirical studies have demonstrated that classification of items by levels of cognitive processing is difficult even for trained reading researchers (DeStefano, Pearson, & Afflerbach, 1997) and university teachers (Alderson & Lukmani, 1989). Additionally, in validity studies of reading tests such as the SALT-reading (SAlsa Literacy Test) and PISA, the assumed sets of subskills underlying the construction were proven empirically indiscriminable because of high intercorrelations (Meijer & van Gelderen, 2002; van Steensel, Oostdam, & van Gelderen, 2012). Thus, even if a division by cognitive targets may be useful in order to ensure domain coverage, it is uncertain whether test scores may be reliably reported by subscales of cognitive process levels.
Method and material
Comparisons in the present study are based on frameworks, text samples, task samples, and scoring guidelines from national reading tests in Norway and Sweden and from final examinations of reading in Denmark. Data include tests in use from 2011 to 2014. These tests are all designed anew for each test administration and use no anchoring of texts or item banks. Documentation concerning construct definitions and relationship between tests and national curricula was gathered chiefly from policy documents on national testing systems and administration published by the Danish Ministry of Education (henceforth Danish Ministry), the Swedish National Agency, and the Norwegian Directorate for Education and Training (henceforth Norwegian Directorate). Norwegian national reading tests are defined both in a general testing framework (Norwegian Directorate, 2010) and in a specific guideline for the reading tests (Norwegian Directorate, 2011). The Danish reading tests are described primarily in a guideline on final examinations in the subject of Danish (Danish Ministry, 2014). The most detailed information about content and construct definition of Swedish reading tests is found in the teacher guidelines accompanying the test material every year (e.g., Swedish National Agency, 2014). The comparison of construct definitions is concerned with three levels: (1) general description of reading ability; (2) definitions of reading processes (cognitive targets) to be tested; and (3) reading material. As for more specific information, such as guidelines for text sampling and item construction, no such data has been found to be available.
Categorization of texts used in the sample includes length (number of words), format (continuous, non-continuous, or mixed), 2 and text type (argumentation, description, exposition, narration, instruction, poetry, 3 and multiple texts). Text type categorizations such as these typically do not capture authentic texts in their complexity. Yet, in national reading tests it should be expected that text samples are representatives for the various types of reading to be measured. Therefore, a categorization of text type, based on predominant characteristics, may still be informative. Level of difficulty would also be a critical variable had the tests been targeting students of either the same age or the same grade. In this case, the Norwegian students are two and a half years younger than the Swedish and Danish students, which is why difficulty level is hard to interpret comparatively between the various tests.
Tasks are categorized by item format and cognitive target. Items for testing reading comprehension can be constructed in a number of ways (Alderson, 2000), although MC questions are typical in many standardized tests (Campbell, 2005; Khalifa & Weir, 2009; Rowe, Ozuru, & McNamara, 2006). The comparison in this study distinguishes between five different item formats: standard multiple-choice (SMC); short-answer constructed response (SACR); open-ended constructed response (OECR); gap-filling multiple-choice (GFMC); and items that combine multiple-choice and short-answer or open-ended response (CMS/O). SACR are tasks that call for concise writing performances, usually a single word or a single sentence, and scoring is more or less objective from a list of pre-specified correct responses. OECR items enable a more extended response and although scoring is related to rubrics, it involves interpretation by the rater and thus a certain level of rater variability. Items were categorized as OECR when they enabled a response of three lines of text or more.
Cognitive target refers to the mental process or cognitive dimension underlying the reading comprehension required to solve a particular type of task (Alderson, 2000; Haladyna & Rodriguez, 2013; Khalifa & Weir, 2009). Since there is no common framework between the three tests by which tasks can be categorized, the cognitive target matrix used for the analysis was developed specifically to provide a balanced representation of the tasks included in the entire sample. It originates from the processes of reading comprehension defined in the PIRLS reading framework (IEA, 2009). 4 However, in order to provide a distinction between processes such as integrating information and inferencing on the one hand and global and thematic interpretations on the other, a five-level matrix was developed instead (see Table 1). The model also allows for a distinction between reflection upon textual structure as a text-based activity and reflection on global meaning that draws upon knowledge and experience, which is argued for by Frederking, Henschel, Meier, Roick, Stanat, and Dickhäuser (2012).
Cognitive targets used for coding items.
All items (n = 465) were coded (by the author) according to the matrix in Table 1 in order to provide comparable data for the three tests. Coding included examination of each item in relation to the information provided in the text and in the scoring guideline, which sometimes offered additional information about the test constructor’s intention. Unfortunately, there was no second coder in this study to allow for inter-rater reliability check. Instead, an intra-rater reliability check was conducted by recoding a random sample of 60 items (13%) in total (five from each year and country). Kappa statistics indicated a reasonable proportion of intra-rater consistency (κ = 0.78, p < .0001). However, since previous studies indicate how difficult it is even for reading experts to agree on the cognitive targets of test items (Alderson & Lukmani, 1989; DeStefano et al., 1997), the coding may still suffer from imprecision. The likely effects of this are discussed in the final section of the article. It should be noted that this dimension of uncertainty is confined to the coding of cognitive targets, as classification of other variables such as text length or item format offer little room for subjectivity.
Purpose and timing
The tests included in the current study all aim at measuring student achievement according to curriculum goals. However, in Norway, reading is defined as a basic skill that spans across the entire curriculum, whereas in Sweden and Denmark, the reading tests are defined by subject standards in Swedish and Danish respectively. Although a mutual label such as “national test” may imply mutual objectives of testing, the purpose and use of these three tests are somewhat different. Danish and Swedish reading tests included in the study are part of a summative assessment to measure students’ level of achievement at the end of compulsory school. These tests are administered during the spring term 5 in 9th grade when students are 15–16 years old. 6 Swedish national tests serve two primary purposes: to support reliability of subject grading across the country and to evaluate educational quality at various levels (Swedish National Agency, 2014). Danish tests are more decisively focused on measuring student achievement according to subject standards in Danish and are administered after subject grading (Danish Ministry, 2014). In contrast, Norwegian national tests are administered in the autumn term of 8th grade, when students are 13–14 years old, for the primary purpose of evaluating educational quality comparatively at the level of schools and municipalities (Roe & Lie, 2009). 7 The timing, however, also allows formative use of test results by providing teachers with detailed information of students’ proficiency in order to guide future teaching and learning in the classroom.
Results
Construct definitions
Danish reading tests in 9th grade start out from subject standards in Danish (Danish Ministry, 2009) and assessment concerns students’ “reading abilities (reading comprehension and reading speed)” (Danish Ministry, 2014, p. 12). Students are furthermore expected to be able to “use different reading forms, reading techniques and reading comprehension strategies in order to solve the tasks quickly and purposefully”. The tasks require reading techniques such as “skimming text,” “search reading,” and “close reading.”
Norwegian reading tests in grades 8 and 9 assess reading ability as a basic skill integrated in all subjects. The definition of reading stresses the “ability to read, comprehend and use texts that students may come across in school or in other domains of life, and to be able to take an independent and reflective stance toward form and content of the texts” (Norwegian Directorate, 2011, p. 3). The reading ability to be assessed is described as something more than decoding, rather it is “a requirement for participation in society, both in future education, in working life and in private life” (2011, p. 3).
In agreement with subject standards in Swedish (Swedish National Agency, 2011b), Swedish reading tests in 9th grade aim at assessing students’ ability to “read and analyze literature and other texts for different purposes.” (Swedish National Agency, 2014, p. 20). A reader needs to master “different reading strategies for comprehending, interpreting and analyzing texts depending on context and purpose of reading” (2014, p. 20).
From general descriptions such as these, the test domains to be compared may appear quite similar, although some discrepancies are detectable already. The construct definitions also include schemes of reading processes to be tested (see Table 2). In Norway and Sweden, the process schemes are used for defining the cognitive target of each test item. In Denmark, the processes serve rather as a detailed account of the challenges included in many test items, and not as a scheme for organizing each item.
Reading processes defined by testing frameworks in Denmark, Norway, and Sweden.
Source: Danish Ministry, 2014, p. 12; Norwegian Directorate, 2012, p. 10; Swedish National Agency, 2014, p. 20.
To define test domains and items by way of established reading process schemes may seem rational from a test construction point of view, but it also entails that these processes play a central part in the curriculum goals. In Norway, this is clearly the case; the curriculum definition of reading as a basic skill in and across all subjects begins with the three reading processes mentioned in Table 2 (Norwegian Directorate, 2012). The process scheme is drawn directly from the PISA reading framework (OECD, 2009).
In Sweden, on the other hand, the process scheme used for test construction – which is copied directly from the reading process scheme used in PIRLS (IEA, 2009) – is not mentioned at all in the subject standards. The attempt to provide comprehensive ground for assessment may thus have negative consequences for the alignment of assessment to subject standards. For example, while the test domain emphasizes aspects such as retrieval of information and straightforward inferences, the subject standards emphasize the ability to discern authorial meaning, global theme, and motif (Swedish National Agency, 2011b), which are objectives not targeted in the national tests. Therefore, the claim that Swedish national test scores measure the level of students’ reading ability according to curriculum standards is an inference that requires a specific and more extensive validity argument (Kane, 2013). To date, no such validity argument has been provided by the test constructors.
As for the Danish definition of reading processes, it is rather well established in subject standards for Danish (Danish Ministry, 2009), in that the same reading techniques and connections to general knowledge and knowledge about language are stressed repeatedly both in the examination guideline and in the curriculum. However, since test items are not defined explicitly by the reading process scheme, the specific relationship between subject standards and test domain requires a more comprehensive analysis.
Reading material and composition
In this section, quantitative and qualitative aspects of the reading material included in the three tests are presented. Table 3 reports on text types and text formats. Table 4 presents the amount of text to be read, the amount of literary text, and the time provided in each test design for reading and responding. Table 5 displays the distribution of items over text types included. The three tests have a general structure in common, in that they are all based on a set (n = 5–8) of longer texts 8 to be read and a set (n = 1–10) of items related to each text. As evident in Tables 3 and 4, types, formats, and amounts of text included vary to some degree, whereas the amount of time provided for reading and responding vary considerably.
Text types and text formats.
C = Continuous text.
NC = Non-continuous text. Mix = Mixed text.
Amount of reading and time provided for reading and responding.
Distribution of items over text types.
A few items in the Swedish tests relate to several texts of different types, why “multiple texts” is included as a category.
Descriptive and narrative texts evidently play an essential part in the testing of reading comprehension in the Scandinavian countries (Table 3). Swedish and Norwegian tests also include argumentative texts, whereas Danish tests do not. Danish and Norwegian tests include expositions (explanations and explications of concepts or constructs) from subject areas such as biology, physics, and history, or as columns, none of which are included in the Swedish tests. Swedish tests, on the other hand, include poetry, which Danish and Norwegian tests do not. The total amount of literary text (narration and poetry) included (Table 4) and the number of items related to literary text (Table 5) are considerably higher in Sweden and Denmark than in Norway. 9 Narrative text is normally represented by a short story or an extract from a novel. Although Swedish and Norwegian tests include only contemporary stories, Danish tests systematically include one fairy tale or folk tale (“fortælling”) and one contemporary story in each test.
From Table 5, we also learn that Danish tests are markedly different from Swedish and Norwegian tests, in that they put longitudinal and systematic focus on three text types only, whereas Swedish and Norwegian tests demonstrate a more variable emphasis over time in terms of text type sampling. 10 The Danish tests are also more systematic in the sense that each year the test includes five texts and 10 questions for each text. The number of items related to each text type in Norwegian and Swedish tests is not fixed in this way.
As for the amount of non-continuous text material, Norwegian and Danish tests generally include diagrams, tables, and/or illustrations in combination with continuous text, whereas in the Swedish material, the only texts coded as non-continuous are poetical texts.
The amount of time provided for reading and responding is strikingly different in the three tests (Table 4). In Denmark, significantly shorter time (30 min) is provided for completing the test than in Norway (90 min) and Sweden (200 min). This is probably related to the fact that reading speed and the ability to skim text in search of information are included in the Danish definition of reading ability. It should be noted that both Danish and Norwegian tests include larger proportions of MC items than the Swedish test (Table 6), which may allow for quicker responding. But taking into account (1) that Norwegian tests are administered to students who are two and a half years younger than Swedish students, (2) that the amount of text to be read is almost equal between the three tests, and (3) that there are twice as many items in the Norwegian test (albeit being primarily MC items) as in the Swedish test, the time allocated for reading and responding in Sweden stands out as comparatively very extensive.
Number of items in different response formats.
From a composition and reading material point of view, it should thus be noted that there are some distinct differences between what is being tested in the three Scandinavian national reading tests. Although the construct definitions do not account for dissimilarities related to textual content, the analysis of composition and reading material clearly demonstrates that operative definitions diverge substantially from each other. This also concerns the variable time constraints, as students most likely will have to adapt their strategies for comprehension to the time provided for reading and responding.
Response format
A comparison of response mode exposes additional differences (Table 6). The most apparent discrepancy is that while Danish tests are based entirely on multiple-choice items (80% SMC and 20% GFMC), Norwegian tests include about 25% constructed response items (mainly SACR) and Swedish tests about 75% constructed response (roughly 60% SACR and 40 % OECR). Moreover, the last section of the Danish tests is a column including 10 gap-filling items (GFMC). This technique is not used at all in Swedish and Norwegian tests. Although there is still some controversy as to whether MC and CR items are capable of measuring the same type of reading skills, the use of GFMC items indicates an emphasis on lexical and syntactic knowledge rather than global or higher-order comprehension (Weir, 2013).
In Swedish tests, the proportion of MC items increases over the sampling period. There is nothing in the new curriculum from 2011 that motivates a shift of this kind. However, the national tests in Sweden, which are scored by the class teachers and not by external examiners, have been heavily criticized for low degrees of inter-rater reliability (Swedish Schools Inspectorate, 2013). 11 Thus, by including larger proportions of MC items, the scoring procedure is believed to become more objective. A similar argument is referred to by Norwegian test constructors for choosing a test design in which at least 70% of all items are MC (Roe & Lie, 2009).
Reading comprehension processes
What, then, is the intended domain of comprehension processes in these tests? By analyzing and coding each item individually with reference to the type of cognitive demand in focus, it is possible to compare the three tests in terms of cognitive complexity, or domain of reading ability to be measured. It should be noted that cognitive target in individual items depends just as much on the text (Rowe et al., 2006) and the scoring guideline (Solheim & Skaftun, 2009) as on the particular question formulation. These variables are thus taken into account in the judgment of cognitive target for each item. Table 7 shows the distribution of items over cognitive targets.
Distribution of items over cognitive targets.
Table 7 reveals some country specific characteristics as to the relative distribution over cognitive targets. The Danish tests, for instance, seem to put distinctly greater emphasis on measuring basic reading skills (i.e., retrieving explicit information and making local, straightforward inferences) than Norwegian and Swedish tests do. Conversely, the Norwegian test and especially the Swedish test appear to be more focused on testing interpretive and reflective aspects of the reading ability. These differences in the test domain match the construct definitions reported above. While the Danish framework stresses the ability to use reading techniques, solve tasks quickly, and skim text, Norwegian and Swedish frameworks rather emphasize reflective and analytical processes of reading.
According to the data reported in Table 7, Norwegian and Swedish tests also give a significant role to the examination and recognition of language, a dimension which is almost non-existent in the Danish tests. These are items that typically ask the test-taker to recognize the function of a particular structural or linguistic element (e.g., the meaning of quotation marks for a specific word or the meaning of certain phrases or figures of speech).
A small share of items, observed in all three tests, targets the more abstract and global meaning-making process in reading. These are items that ask for thematic interpretations or items that require the test-taker to use substantial world knowledge for example in order to make personal reflections upon the text or explain aspects of a story or of a character’s intentions. While the Norwegian and the Swedish testing frameworks stipulate this dimension as a cognitive target, the Danish examination guideline does not. In spite of this, the most characteristic examples of this level of reading processing are found in the Danish tests, particularly related to the contemporary narrative text. Here is an example:
□ the ambition of parents □ competition between friends □ conflicts between teenagers □ the relationship between teachers and parents
For a test-taker to arrive at the correct alternative on this task, he or she needs not only to retrieve and integrate ideas and information from the entire text (which is 1530 words long), but also to make the appropriate global inference about the kind of human conduct or affairs that the story topicalizes. And even though the distractors might not be very plausible in the actual case, the abstract articulation of all four alternatives will certainly cause problems for many test-takers.
Discussion
The study sets out to examine (1) in what way test domains and construct definitions vary between the national reading frameworks in Denmark, Norway, and Sweden, and (2) in what way reading material and item construction vary between national reading tests in the three countries. As demonstrated above, there are some obvious similarities between the three test designs in terms of reading material and basic structure, but the study also exhibits a number of critical differences. These differences concern construct definitions (as defined by testing frameworks), test content (as defined by reading material and cognitive targets), and methods for measurement (as defined by the number, structure, and format of items). In doing so, the study also indicates potential directions for the future development of reading comprehension assessment in the three countries.
As for construct and content, the testing of reading ability in Denmark appears more closely focused on techniques for quick and purposeful reading, and for students to be able to skim text and gather relevant information. This is indicated both by item construction and time constraint. In turn, Norwegian and Swedish tests are clearly more focused on analytic, interpretive, and reflective aspects of reading. Findings from the study also demonstrate that the structuring of items, both within test and over time, is more systematically organized in Danish reading tests than in Norwegian and Swedish tests. The selected types of text to be read and the number of items for each text are, for instance, more consistent over time in Denmark. On the other hand, items in the Danish test are not explicitly defined by expected cognitive targets, which makes differentiated profiles related to subskills of reading difficult to attain (cf. Pøhler & Sørensen, 2010).
The fact that the coding of items was conducted by a single coder represents a clear limitation to the study and needs to be considered when drawing conclusions about the different test designs. Small differences in the proportional distributions either between countries or between categories of cognitive target should be treated cautiously. Other findings (e.g. the relatively stronger focus in Denmark on focus-and-retrieve items or the relatively stronger focus in Sweden on integrate-and-interpret items) are so clear that the potential imprecision of coding is unlikely to change the overall picture. Moreover, unreliability issued by a second coding would not be too harmful as long as the second coder applies the cognitive process matrix consistently (intra-reliably). It would potentially affect the total distribution of items between the categories, but it would be less likely to change the relative distributions of cognitive targets between countries. Therefore, the comparative component drawn from the single coding may still be relatively stable.
Although the three tests are performed for slightly different purposes, the differences discussed in the study reveal areas of potential progress in the development of future reading tests. As for the Danish test, it should perhaps be investigated whether the strong focus on basic skills is really motivated by the curriculum, and whether this focus in itself really designates an appropriate construct of the reading ability expected at the age of 16. A relevant scheme of reading processes according to which the test domain and the sample of items could be organized seems to be a reasonable first step. It entails, of course, that the process scheme is validated with regard both to curriculum goals and psychometric standards (cf. Pearson & Hamm, 2005). However, while both Norwegian and Swedish test designs use predefined reading process schemes (from PISA and PIRLS respectively) to get a systematized measurement of reading ability, the variation of text type, length and number of items included for each text type demonstrated over time indicate that what is measured one year is not necessarily the same thing as what is measured the next year. A similar risk of increasing the dependency on construct-irrelevant factors is related to the inclusion of fewer and longer texts, fewer test items, and the demand of longer written responses from test-takers. Thus, the Swedish test appears more exposed to such threats to validity than the reading tests in Denmark and Norway. This is a troublesome aspect of the Swedish tests that needs to be considered carefully in preparation for future test development.
In conclusion, the study offers a comparative insight into the qualities and quantities of reading comprehension assessment in the Scandinavian countries. For a more substantial understanding of the type of reading ability actually measured, further research is necessary, particularly about cognitive targets of items and their relationship to response format. In the test, the distribution of cognitive targets is defined by the test constructor. In this analysis, it is defined by a researcher’s coding procedure. Whether or not the given categories really represent what is going on in students’ empirical reading while taking the test remains to be examined, for example, by way of think-aloud protocols during actual test-taking. Moreover, as the framework of analysis in itself may be seen as an outcome of the study, student data would also be valuable for empirical validation of the model. Similarly, in order to support ecological validity of the tests and to validate “positive washback effects,” categories of subskills also need to be shared by teachers and external examiners. Such studies could be conducted by using task samples from the three test designs in order both to evaluate process validity and to compare the effects of item format.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
