Abstract
On the basis of a validation study of a new test for assessing low-achieving adolescents’ reading comprehension skills – the SALT-reading – we analyzed two issues relevant to the field of reading test development. Using the test results of 200 seventh graders, we examined the possibility of identifying reading comprehension subskills and the effects of task specificity on test reliability. Regarding the former, we distinguished three subskills indicating different levels of understanding (‘retrieving’, ‘interpreting’, ‘reflecting’). However, confirmatory factor analyses did not support the presence of these subskills. Task specificity refers to the situation that different tasks within a test are not uniformly difficult for individual test takers, which constitutes a form of error negatively influencing test reliability. However, Generalizability Theory analysis showed that such task-specific effects did not occur: the reliability of the SALT-reading was primarily affected by error associated with the score variance within tasks.
Keywords
This paper deals with two issues relevant to the field of reading test development: the possibility of identifying reading comprehension subskills and the effects of task specificity. These issues were examined in a validation study of a new test for assessing low-achieving adolescents’ reading comprehension skills, the SALT-reading. 1 Part of this study was an analysis of the test’s ‘content representativeness’ (Messick, 1995), the extent to which a test is an adequate representation of the construct domain to be measured. Bachman (2002) argues that to ensure content representativeness a language test should be construct-based and task-based. The former implies that the test covers important aspects of the ability to be measured. The latter implies that the test constitutes a representative sample of the tasks relevant to the construct domain, across which the tester can generalize to provide an overall assessment of the test taker’s proficiency in this domain. These demands raise two empirical questions: can subskills of reading comprehension be measured by means of reading comprehension tests? And, to what extent can generalizations be made across different reading tasks?
Reading comprehension subskills
Questions in reading comprehension tests are often classified on the basis of the location of the information to be obtained (local/global text level) and the explicitness of the match between the question and the information in the text, resulting in item categories such as identifying main idea, locating details, or making inferences (Cerdán, Vidal-Abarca, Martínez, Gilabert, & Gil, 2009; Davis, 1968; Goldman & Durán, 1988; OECD, 2003; Rosenshine, 1980; Rouet, Vidal-Abarca, Erboul, & Millogo, 2001; Song, 2008; Vidal-Abarca, Gilabert, & Rouet, 1998).
Theoretically, question answering is seen as a categorization mechanism that identifies the type of question and the relevant information sources (Cerdán et al., 2009; Goldman & Durán, 1988; Graesser, Gordon, & Brainerd, 1992; Mosenthal, 1996; Rouet et al., 2001; Vidal-Abarca et al., 1998). Identifying relevant information sources involves activating nodes in a knowledge network consisting of both textual information and background knowledge. Rouet et al. (2001) argue that the nature of a test item determines how this activation process evolves, regarding both the number of nodes involved and their accessibility. Test items can differ first of all because they focus either on a single concept or on a (hierarchical) set of concepts. In the former case, the number of knowledge nodes to be activated is limited, whereas the latter case requires the activation of a larger number of nodes. Test items can also differ in the match between the wording of the item and the relevant information in the text. If there is an explicit, literal match between question and text, the knowledge node necessary for answering the question can be accessed directly. If there is no explicit match, the node must be accessed indirectly, through the contextual activation of relevant knowledge structures.
Research shows that such differences between test items result in different reading behavior. In two experimental studies involving university and high school students, Rouet et al. (2001) and Vidal-Abarca et al. (1998) found that ‘low-level questions’ – focusing on a single concept, requiring only superficial text processing – often resulted in students using a locate-and-memorize strategy, browsing several paragraphs rather quickly and searching a small number of paragraphs per question. On the other hand, ‘high-level questions’ – focusing on a broader set of concepts or concepts higher in a hierarchy, requiring readers to generate more connections between knowledge and text information – often resulted in students using a review-and-integrate strategy, pausing on more paragraphs during their search for the right answer, searching a larger number of paragraphs per question, and showing more search iterations (i.e. sequences of question reading, text reading, and answer writing).
Many tests include the different item types described above, sometimes broadly categorized as lower, higher, and intermediate. Lower level items involve retrieving detailed information, located in specific text segments (OECD, 2003; Rosenshine, 1980; Rouet et al., 2001; Song, 2008; Vidal-Abarca et al., 1998). This information is either verbatim or requires a minor conversion, for instance, when the answer is a synonym of a word in the question (Cerdán et al., 2009; Davis, 1968; Goldman & Durán, 1988; OECD, 2003). In the terminology introduced earlier, these items involve single, directly accessible nodes in the knowledge network. Higher level items involve comprehending a text’s theme or main idea (OECD, 2003; Rosenshine, 1980; Song, 2008), the author’s general purpose, tone, or attitude (Davis, 1968; OECD, 2003), and the ability to draw conclusions from or make predictions on the basis of the text (Rosenshine, 1980). These items involve multiple nodes related on a global text level, which need to be integrated into central, superordinate nodes. Finally, an intermediate category can be distinguished. These items tap the comprehension of the underlying relationships between local pieces of information, such as cause-effect relationships between sentences (OECD, 2003; Rosenshine, 1980). They involve the integration of nodes that are not related on the global, but on the local level.
Divisibility of subskills
There is controversy about whether subskills such as those mentioned above are empirically distinguishable by tests (Alderson, 2000; Kalifa & Weir, 2009; Weir & Porter, 1994): while some researchers provide evidence for dimensionality, others question whether separate skills can actually be measured. In an early study, Davis (1968) established the presence of five skills on the basis of reading test data collected among 12th graders in academic high schools: memory for word meanings; inferencing; following passage structure; recognizing a writer’s purpose, attitude, tone, and mood; finding answers to questions asked explicitly or in paraphrase. Applying factor-analysis to the same data set, Spearitt (1972) found support for the presence of the first four skills, although the factor intercorrelations were high (.75–.93). In another reanalysis Thorndike (1973–1974) found two factors: general reading comprehension and word knowledge. Schedl, Gordon, Carey, and Tang (1996) analyzed the factor structure of the reading comprehension items of the Test of English as a Foreign Language (TOEFL), but did not find evidence for the presence of the two postulated dimensions, that is, ‘reasoning’ (analogy, extrapolation, organization and logic, and author’s purpose/attitude) and ‘other’ (vocabulary, syntax, and explicitly stated information). Meijer and Van Gelderen (2002) analyzed the Dutch PISA data – collected among students in all levels of secondary education – and found that the three subskills assumedly underlying the PISA reading test were indiscriminable because of very high intercorrelations (around .95). Song (2008) conducted Structural Equation Modeling to analyze the results of a reading comprehension test for international (under)graduates and found evidence for the presence of two skills (although three were postulated): the ability to understand explicitly versus implicitly stated meaning.
Weir and Porter (1994) argue that divisibility is a function of the level of the readers being tested, claiming that for proficient readers, reading is unidimensional, while for less proficient readers it is not. Similarly, Alderson states: ‘once basic word recognition skills have been acquired, and children are able to understand connected text, the whole reading process becomes so very integrated that, although a variety of skills may be needed, they cannot be separately identified empirically’ (2000, p. 97). However, for beginning readers – including those who learn to read in a second language – it can be assumed that different abilities develop at a different rate, as a result of which proficiency in one skill does not necessarily mean proficiency in another. For low achievers – our target group – it could be assumed that they fall behind in specific kinds of skills but do better in others, making it similarly possible to empirically distinguish between these skills.
Task specificity
The demand that a language test is task-based implies that the test tasks are a representative sample of all tasks relevant to the language use domain of interest (Bachman, 2002). In reading comprehension tests, tasks usually consist of texts and questions. Together, test takers’ responses to these tasks are taken to provide a valid indication of their ability to understand texts from a particular domain. Texts naturally differ in topic/content, genre, format, and medium of presentation (Alderson, 2000). This variability has two likely consequences. First, it will lead to score differences between test takers: as a function of their familiarity with a topic, genre, format, and/or medium, some test takers will do better on a particular task than others. Second, it is expected that different configurations of these characteristics cause an individual test taker’s scores to vary from one task to the other; this is also referred to as ‘task specificity’ or ‘case specificity’ (Gagnon, Charlin, Lambert, Carrière, & Van der Vleuten, 2008; Linn, 1993; Linn, Baker, & Dunbar, 1991; Norman, Bordage, Page, & Keane, 2006).
The effects of task specificity were previously established by means of G(eneralizability) theory analysis in studies on performance-based writing and speaking assessments (Lee, 2006; Sawaki, 2007; Schoonen, 2005; Xi, 2007). Using ANOVA techniques, G theory analysis aims to identify the different sources of error affecting reliability. The general observation from these studies is that the different writing/speaking tasks within a test are not uniformly difficult for individual test takers: some test takers may be rank-ordered low on task A but high on task B, while for others the reverse is true, even though both tasks assumedly measure the same skill. This phenomenon – which in G theory analysis is understood as an interaction effect of test takers and tasks – constitutes a form of error affecting a test’s reliability (Cardinet, Johnson, & Pini, 2010; Shavelson & Webb, 1991), which is best countered by increasing the number of tasks.
Apart from the targeted skill, reading comprehension tests are different from writing and speaking assessments because they generally have a nested structure: comprehension questions are nested within tasks. Hence, it is possible that error variance is not only constituted by the score variance between tasks (between-group variance in ANOVA terms), but also by the score variance within tasks (within-group variance). Evidence for the importance of within-task variance comes from G theory analyses of medical knowledge assessments, which have nested structures comparable to those in reading tests (Gagnon et al., 2008; Norman et al., 2006): in both studies within-task variance was found to be the most important error source, while the contribution of between-task variance was very small or even absent. Gagnon et al. (2008) and Norman et al. (2006) also showed that, contrary to writing/speaking assessments, improvements in the reliability of these nested tests did not result from increasing the number of tasks, but from increasing the number of items per task (whereas the number of tasks could be decreased).
It is worthwhile examining whether the same mechanism is true for reading comprehension tests, not in the least because this would have an important practical advantage: since each extra task requires a test taker to read additional text, increasing the number of items per task while decreasing the number of tasks will likely result in shorter administration times.
Research questions
The aim of this study was twofold. First, we wanted to examine the assumption that reading comprehension subskills can be distinguished in low achievers. Second, we wanted to examine to what extent the reliability of the SALT-reading was affected by task specificity and, consequently, what the optimal task/item ratio is for obtaining a sufficient reliability level. These aims lead to the following research questions:
1. To what extent does the SALT-reading measure different subskills?
2A. To what extent is the reliability of the SALT-reading affected by task specificity?
2B. What is the optimal number of tasks and items within tasks to maximize test reliability within a fixed administration time?
Method
Participants
We administered the SALT-reading to 200 seventh graders from the lowest tracks of Dutch prevocational secondary education, the basic and middle-management programs (BP and MMP, respectively). This population is characterized by poor reading skills: PISA results revealed that, on average, ninth-grade students in these tracks only reach level 1 (BP) or 2 (MMP) of the PISA reading levels (De Knecht-Van Eekelen, Gille, & Van Rijn, 2007). The students were spread across 10 classes from nine schools. Of these, 38% were girls and 62% were boys; 48% were of native Dutch families, 38% were of nonnative families, and 14% were of mixed families. On 1 January 2008, the year of administration, the students had a mean age of 12.79 years (SD = 0.66).
The SALT-reading
The SALT-reading was developed in an international study by the universities of Amsterdam and Utrecht (the Netherlands), the Ontario Institute for Studies in Education at the University of Toronto (Canada), and the Service de la Recherche en Education of the Canton of Geneva (Switzerland). This paper concerns the Dutch part of the project.
The test consists of a series of tasks, each comprising one or more texts followed by comprehension questions. Starting from the PISA definition of reading literacy, which acknowledges that reading involves understanding, using, and reflecting on written information for a variety of purposes and in a variety of situations (OECD, 2003), we included relatively many tasks. To ensure the test was sufficiently task-based, we strived for maximum coverage of different configurations of topic/content, genre, format, and medium. We additionally assumed that, given the possibility of the effects of task specificity, maximizing the number of tasks would increase the chance of reaching a sufficient reliability. Given practical limitations (we could only plan three 45-minute sessions), we included nine tasks, the maximum students were able to complete in the allotted time in a pilot study conducted beforehand.
We started by identifying media types students likely come across regularly: (school) books, newspapers and magazines, official documents, and the internet. Within these media we searched for texts about topics relevant to our target group, representing different combinations of genres and formats. To find texts with relevant topics we inspected sources we assumed to be familiar to adolescents: children’s books, educational materials for students in secondary education, newspapers and magazines that were readily available (neighborhood newsletters, TV guides) or written especially for our target group (youth magazines), official documents students are likely acquainted with (house rules), and websites students might visit for recreational or information seeking purposes. Regarding genre we made a distinction between narrative, argumentative, expository, and instructive texts. Regarding format we made a distinction between continuous texts and discontinuous texts (i.e. texts containing lists, tables, graphs). This led to an initial pool of 14 texts.
For each text we constructed items based on the distinction between lower, intermediate, and higher levels of understanding (‘Divisibility of subskills’ section), which we labeled ‘retrieving’, ‘interpreting’, and ‘reflecting’, respectively. Retrieving is the ability to locate relevant details in the text, interpreting is the ability to make inferences about shorter passages (e.g. identifying causal relationships between sentences), and reflecting is the ability to make inferences about larger passages or the text as a whole (e.g. articulating the main idea or the writer’s intention). We ensured every task included items from all three categories.
On the basis of the results of a pilot study among 550 students (grades 7–9 from the target population), to whom we administered subsets of tasks, we selected nine tasks. The selection was based on psychometric analyses and the outcomes of interviews with 40 of the students. We based our decision to include tasks on (i) the average difficulty levels of the tasks: tasks that were too easy or difficult were excluded to prevent ceiling or floor effects; (ii) the contribution of the items in a task to overall test reliability: if many items in a task had low item-total correlations (<.20), the task was omitted; and (iii) students’ judgments about the relevance and difficulty level of the texts and items. The latter was additionally investigated in a qualitative study among Dutch language teachers who were asked to judge the test (Daas, Havermans, & Van Noortwijk, 2009). The findings corroborated our assumption that the test tasks resemble those students have to be able to carry out in daily (academic) life. Table 1 provides an overview of the tasks.
Overview of reading tasks
Note: T = task
We constructed 65 items: 21 retrieving, 26 interpreting, and 18 reflecting items. To establish interrater reliability, all items were also categorized by an independent rater: the agreement was satisfactory (r = .83). All nine tasks included multiple-choice items (57 items), which mostly provided four choices. Six tasks also included open-ended items (8 items). Both types covered retrieving, interpreting, and reflecting items. For each of the open-ended items, we wrote a model answer. Mostly, test takers could be awarded 0 (wrong) or 1 (right). In some cases, there was an in-between score: for one item, for example, test takers had to describe how the protagonist of a story felt and why. Test takers were awarded 0.5 if they provided an appropriate feeling and another 0.5 if they provided a valid argument. All answers were coded twice. The interrater agreement was substantial (mean Cohen’s kappa = 0.73).
Not all media were equally well represented: the test includes only one official document and only one internet text. The former is a result of the fact that official documents are restrictive in the genres they allow. The latter is an outcome of the pilot study: we tested two internet tasks, but for one most items made very limited contributions to overall test reliability and we excluded it.
Procedure
The test was administered to whole classes in three 45-minute sessions. We scheduled no more than two sessions per day to minimize test weariness. Eight tasks were paper-and-pencil assignments; the internet task was administered in a computer room. All sessions were introduced by a researcher or trained test leader. A familiar teacher was present to maintain order. Students’ questions were answered by the test leaders according to a standardized protocol.
Analyses
Identifying subskills
To examine whether subskills of reading comprehension could be identified, confirmatory factor analyses were performed using Structural Equation Modeling software (EQS 6.1; Bentler & Wu, 2002). We compared a first-order single-factor model assuming the test measures one underlying ability with a hierarchical second-order model assuming the test measures three dimensions that load on one general reading comprehension factor. Item parcels were entered as observed variables. Parcels are aggregated indicators comprised of, in this case, the sum of multiple items (Little, Cunningham, & Shahar, 2002). The parcels were compiled by means of (stratified) random sampling. First, the 65 items were classified on the basis of the three hypothesized subskills, resulting in three sets of items: 21 retrieving, 26 interpreting, and 18 reflecting items. Subsequently, three samples were randomly drawn from each of these three sets, resulting in a total of nine samples. The items from these samples were added to form nine parcels (RTR1-2-3 with seven items in each parcel; INT1-2-3 with nine items in the first two parcels and eight in the third; RFL1-2-3 with six items in each parcel). In order to use all available information, the EM procedure in EQS was used to deal with missing values.
Several indicators were used to judge model fit (Dunn, Everitt, & Pickles, 1993; Hu & Bentler, 1999; Ullman, 2001): the χ2-test, which should be nonsignificant; the ratio χ2/df, which should be less than 2; the comparative fit index (CFI), which should be above .90; the standardized root mean square residual (SRMR), which should be below .08; and the root mean square error of approximation (RMSEA), which should be below .06. The χ2 difference test was used to compare both models (Ullman, 2001).
Task specificity
We examined the possible effects of task specificity using G theory analysis, which allows researchers to disentangle the multiple error sources affecting test reliability and helps to explore ways of improving measurement precision (Brennan, 2001; Cardinet, Johnson, & Pini, 2010; Shavelson & Webb, 1991). G theory analysis involves two steps. In the first, the generalizability (G) study, the observed measurement is decomposed into different variance components (‘facets’). A distinction is made between differentiation, instrumentation, and generalization variance (Cardinet, Johnson, & Pini, 2010). Differentiation variance is similar to true score variance in classical test theory (CTT). Instrumentation variance is the variance caused by the measurement conditions, here the tasks in the test (T), the test items nested in the tasks (I:T), and the interactions between students and tasks (S×T), and between students and items in tasks (S×I:T). Generalization variance is similar to error variance in CTT, the part of the variance attributable to fluctuations arising from the random selection of the components of the measurement procedure as well as to purely random, unidentified error. Because in our case G theory analysis assumes relative measurement (the aim is to position students relative to one another in a score distribution), generalization variance is only constituted by the interaction terms – S×T and S×I:T – where the former constitutes the between-task error variance and the latter the within-task error variance. The interaction term S×I:T also includes the residual error (e). The results of the G study are summarized in the standard error of measurement and the generalizability coefficient, which, like the reliability coefficient in CTT, reflects the proportion of systematic variability in students’ scores (Shavelson & Webb, 1991).
The second step is the decision (D) study, which examines the effects of applying alternative measurement designs on minimizing error and maximizing reliability. In the case of multiple instrumentation facets, this could involve increasing the number of values (‘levels’) for those facets that contribute most to measurement error while decreasing the number of levels for facets with low impact on measurement error (Cardinet, Johnson, & Pini, 2010).
We used the software package EduG 6.0 (IRDP, 2010). The fact that the test constituted an unbalanced nested design – the tasks comprised unequal numbers of items (see Table 1) – required that we perform an additional operation before conducting the actual G theory analysis. Following Cardinet, Johnson, and Pini (2010), we computed sums of squares in which the tasks were weighted according to the numbers of items they comprised. These sums of squares were entered into EduG.
Results
Descriptive statistics
Table 2 presents the descriptive statistics for the test as a whole, the subskill parcels used in the SEM analyses, and the task scores used in the G theory analyses. Of the 200 students 163 made all parts of the test. The EM procedure was applied to use the scores of the other 37 students (see ‘Identifying subskills’ subsection) as well. Therefore, we also present the numbers of students that completed the items in each of the subskill parcels and each of the nine tasks.
Descriptive statistics
Note: RTR = retrieving; INT = interpreting; RFL = reflecting; T = task
The overall mean shows that the test is not too difficult for the students in our sample, neither is it too easy: on average, about two thirds of their answers were correct. There is some variation between the tasks in mean percentage of correct scores (51%–79%). Yuan, Lambert, and Fouladi’s (2004) normalized multivariate kurtosis estimate proved to be nonsignificant for both the subskill parcels (0.40) and the task scores (−1.82), which implies that the multivariate normality assumption was not violated. For the subskill parcels this meant that Maximum Likelihood estimation could be used for the SEM analyses. In only two cases were the univariate kurtosis estimates statistically significant (T3 and T5). The Cronbach’s alpha reliability coefficient was .80.
Identifying subskills
The two confirmatory factor models are presented in Figures 1 and 2. The model fit results and the outcome of the χ2 difference test are given in Table 3. Note that in the second-order model the error variances (indicated by D for ‘Disturbance’) associated with the first-order factors were constrained to be equal for the purpose of solving parameter conditions encountered during optimization.

Subskills: Single-factor model

Subskills: second-order factor model with three dimensions
Model fit results 2
Both models have a good fit (the χ2-values are not significant, the χ2/df ratios are below 2, the CFIs are above .90, and the SRMRs and RMSEAs are below .08 and .06, respectively). However, the χ2 difference test shows that the second-order model is no significant improvement of the more parsimonious, single-factor model. Thus, the data do not provide evidence for the presence of subskills, but support the assumption that the test is measuring only one underlying ability. The validity of the single-factor solution is further corroborated by the fact that in the second-order model the paths between the second- and first-order factors have very high regression weights (.94−.99).
Task specificity
The complete results of the G study are presented in Table 4. The table includes both the relative and absolute error variances, but given our assumption of relative measurement, we will only discuss the former.
G study outcomes
Note: S = Students; T = Tasks; I:T = Items in Tasks; e = (residual) error
Three variance estimates are of importance. The first is the variance estimate of the student facet S (0.01230), which concerns the differentiation or true score variance. The second and third are the relative error variances associated with the interaction between students and tasks (S×T) and the interaction between students and items in tasks plus residual error (S×I:T, e), concerning the between-task and within-task error variance, respectively. The variance estimates for these interactions (column 4) show that the generalization or error variance (0.00443) was completely caused by S×I:T, e. This first of all suggests that the task effects described in Section IV did not occur: score variability was not caused by individual test takers scoring differently from one task to the other. It further implies that test score variability was mainly determined by individual test takers scoring differently from one item to the other and by other, undifferentiated sources of error (these two error sources cannot be disentangled in G theory analysis). Finally, the fact that the differentiation variance (0.01230) was nearly three times larger than the generalization variance (indicated by a G coefficient of 0.74) implies that a substantial proportion of the total score variance can be attributed to true score variance, that is, to differences in students’ actual reading comprehension ability. The G coefficient is somewhat lower than the Cronbach’s alpha coefficient reported earlier (0.80). This is likely because the G coefficient takes the nested design into account (Lee, 1999).
Subsequently, a D study was performed. Since the G study showed that S×I:T, e fully explained generalization variance, the obvious way of maximizing the test’s reliability is by increasing the number of items per task (as a consequence of which the number of tasks can be decreased). Assuming that three tasks could be comfortably fitted within one test session of 45 minutes, we made predictions of how many items would be needed per task to reach a satisfactory G coefficient of .80, when the number of tasks was decreased from nine to three and six (implying one and two test sessions, respectively). To examine the other possible solution to improving reliability (increasing the number of tasks, while decreasing the number of items per task), we also included the predicted G coefficients when the number of tasks was increased to 12 (implying four sessions). Finally, we computed the number of items needed per task when the current number of tasks (nine) remained the same. The results are in Figure 3.

D study results for 3, 6, 9, and 12 tasks and 5–21 items per task
Figure 3 shows that in the case of three tasks 21 items per task would be needed to obtain a G coefficient of .80, which more than surpasses the number of questions one can reasonably pose about texts such as those included in the test. Increasing the number of tasks to six would imply 11 items per task, which is two items more than the maximum number of items per task in our current test (nine) and six items more than the minimum number (five). If we were to keep the same number of tasks as in the current test (nine), a minimum of seven items per task would be needed to reach a satisfactory G coefficient. Increasing the number of tasks to 12 would imply a minimum of six items per task.
Discussion
We examined two issues relevant to the field of reading test development by analyzing the outcomes of a new test designed to assess the reading comprehension ability of low achievers in secondary education. The first issue concerned the identification of reading comprehension subskills. Although our assumption was that such subskills are more readily identified in low achievers, SEM analyses provided no evidence for dimensionality: a second-order model in which reading comprehension was assumed to consist of three dimensions (retrieving, interpreting, reflecting) proved no significant improvement on a single-factor model. This outcome is in line with the results of studies that failed to distinguish reading comprehension subskills (e.g. Meijer & Van Gelderen, 2002; Schedl et al., 1996) and could be seen as supporting the claim that reading comprehension is a single, unitary ability, even for low-achieving readers (Rost, 1993). Further explanation of this observation can be sought in cognitive models of reading comprehension. Kintsch’s (1998) construction-integration model, for instance, makes clear that, during reading, several different processes are at work. Readers process local, micro-level information in the form of propositions, they connect these propositions to background knowledge, and to preceding and subsequent micro-level information in order to establish local cohesion, and they use this micro-level information in a constant process of macrostructure formation. The evolving macrostructure, in turn, guides micro-level processing, particularly by focusing the reader’s attention on elements relevant to the gist of the text. The fact that these processes are mutually dependent may have caused the unidimensionality observed in our analyses.
The question remains why other researchers, using similar analytical techniques, were able to identify subskills. In reviews of studies on this topic, Kalifa and Weir (2009) and Weir and Porter (1994) suggest that the ability to separate subskills is affected by both test and test taker characteristics. One possibility is that identifying dimensionality is hampered by a test being too easy: Song (2008), for instance, suggests that the fact that she found evidence for only two out of three assumed subskills was a result of the students in his study performing relatively well. Although the students in our sample were low achievers, the test was not overly difficult for them (nor was it disproportionately easy). Following Song’s suggestion a more challenging test could have increased the chance of exposing the postulated dimensions. The question remains, however, where in this respect the boundary between a sufficiently and an overly challenging test lies: after all, if a test is too difficult, limited variability in item scores due to floor effects will make it hard to find relationships among items.
Additionally, our sample consisted mostly of native speakers of Dutch and L2 students who were born in the Netherlands and had learned to read and write in Dutch from an early age. Had the sample included more beginning L2 readers, a different pattern might have emerged, as Weir and Porter (1994) suggest.
Finally, item format might have played a part. Our test comprised mainly multiple-choice items. It has been shown that using such items affects the reading process. In a small-scale study, Rupp, Ferne, and Choi (2006) made clear that multiple-choice questions (even those aiming to assess higher-level skills) can trigger test takers to divide a text in chunks aligned with individual questions. Consequently, the test taker focuses on the microstructure rather than the macrostructure, thereby automatically limiting his/her use of higher-order inferences. Other answer formats might have yielded other results.
The second issue analyzed in this study is the extent to which test reliability is influenced by the situation that the tasks within a test are not uniformly difficult for individual test takers (task specificity). A G study revealed that this interaction of test takers and tasks did not contribute to generalization (or error) variance: in other words, the hypothesized task-specific effects did not occur. Instead, generalization variance was fully explained by the interaction between test takers and items in tasks plus residual error, which parallels the outcomes of G theory analyses in other types of task-based assessments (Gagnon et al., 2008; Norman et al., 2006). This outcome resulted in the observation that increasing the number of items per task (instead of increasing the number of tasks) was the most efficacious way of maximizing the test’s reliability. This procedure has important limitations, however. In theory, reliability could be guaranteed by a test consisting of a single task including many items, but this would result in both practical and validity problems. First of all, there are only so many questions one can reasonably pose about texts such as the ones included in our test: a single-task test would require a long text for a test developer to devise a sufficient number of items. Moreover, for a test to be content valid it should have sufficient coverage of relevant characteristics of the domain of interest (Alderson, 2000): a single-task test will normally not be enough to provide such coverage. Given these limitations as well as time considerations (three 45-minute sessions seemed to be the maximum we could ask from students and schools), keeping the current number of tasks (nine), while ensuring a minimum number of seven items per task seemed the optimal solution. In other situations (and countries) administration times might even be more limited. We believe our study gives clear indications for coping with such limitations.
What do these conclusions imply for the validity of the SALT-reading? We started this paper by referring to Bachman’s (2002) claim that a test should be both construct-based and task-based. If we assume that subskills of reading comprehension exist, we must conclude that our test was not able to identify them. However, if we consider reading comprehension to be a unitary construct, the unidimensionality of the test scores – as indicated by the good fit of the single-factor model – can be seen as being strongly supportive of the validity of the test. To ensure our test was task-based we decided to include a relatively large and diverse set of reading tasks relevant to our target population and we assumed that this would lead to a sufficient reliability. While the Cronbach’s alpha reliability coefficient had a satisfactory value of .80, the G coefficient – which is theoretically more appropriate for nested test designs (Lee & Frisbie, 1999) – was .74. Although this suffices for basic research purposes (Nunnally & Bernstein, 1994), a reliability of at least .80 is desirable. As we have shown, this can be obtained within the practical limitations of test administration by increasing the number of items per task.
Finally, what do our conclusions imply for validity research into other, similar tests? First, we recommend that in the future test developers and researchers either be careful in their claims that reading tests can be used to diagnose specific reading problems or validate these claims by using qualitative studies to confirm that particular categories of test items trigger particular types of reading behavior (Kalifa & Weir, 2009). The latter authors suggest, for example, to incorporate think-aloud studies in the development process to examine whether test items elicit the assumed underlying processes (they call this the ‘cognitive validity’ of a test). We also recommend that, in addition to content validity considerations, outcomes of G theory analyses be used for deciding on the number of tasks and items to be included in reading tests. G theory analyses serve as a useful tool for reading test developers in determining the optimal balance between tasks and items, particularly because they allow taking the nested structure of such tests into account.
Footnotes
Appendix A
Acknowledgements
The project was funded by the Netherlands Organization for Scientific Research (NWO; project number 411-06-504).We thank the schools and students participating in the study as well as Suzanne Jak, Rudy Ligtvoet, and Frans Oort for their statistical advice and Jean Cardinet for his advice on EduG.
