Abstract
“Vocabulary and structural knowledge” (Grabe, 1991, p. 379) appears to be a key component of reading ability. However, is this component to be taken as a unitary one or is structural knowledge a separate factor that can therefore also be tested in isolation in, say, a test of syntax? If syntax can be singled out (e.g. in order to investigate its contribution to reading ability), this test of syntactic knowledge would require validation. The usefulness and reliability of using expert judgments as a means of analysing the content or difficulty of test items in language assessment has been questioned for more than two decades. Still, groups of expert judges are often called upon as they are perceived to be the only or at least a very convenient way of establishing key features of items. Such judgments, however, are particularly opaque and thus problematic when judges are required to make categorizations where categories are only vaguely defined or are ontologically questionable in themselves. This is, for example, the case when judges are asked to classify the content of test items based on a distinction between lexis and syntax, a dichotomy corpus linguistics has suggested cannot be maintained. The present paper scrutinizes a study by Shiotsu (2010) that employed expert judgments, on the basis of which claims were made about the relative significance of the components ‘syntactic knowledge’ and ‘vocabulary knowledge’ in reading in a second language. By both replicating and partially replicating Shiotsu’s (2010) content analysis study, the paper problematizes not only the issue of the use of expert judgments, but, more importantly, their usefulness in distinguishing between construct components that might, in fact, be difficult to distinguish anyway. This is particularly important for an understanding and diagnosis of learners’ strengths and weaknesses in reading in a second language.
Alderson (1993a) argues that the use of so-called experts to judge the content or difficulty of test items is highly questionable as these judgments are often of limited accuracy, reliability and validity. Bachman et al. (1996) and Alderson et al. (2012) have also demonstrated that judgments about salient item characteristics appear rather obscure and arbitrary and that agreement between judges is moderate at best. Nevertheless, language testers still frequently rely on expert judgments when attempting to validate the content or construct of a particular test. In a study investigating the relative significance of different components of second language (L2) reading, Shiotsu (2010), as part of a preliminary study, employed expert judges to justify a test of 35 items as a valid and suitable measure of syntactic knowledge. Based on these judgments, Shiotsu (2010) then decided which items should be included in the test of syntactic knowledge and thus form the basis of further statistical analyses, their results and inferred claims about the nature of L2 reading.
Given Alderson’s (1993a) caution against the use of expert judgments and other criticisms that have been voiced concerning the study in question (Brunfaut, 2009), Shiotsu’s preliminary study should be scrutinized in more detail, analysing closely the logic and rationale behind the inclusion or exclusion of certain items and replicating the study to see whether findings can be corroborated with different groups of experts. The aim of the present paper is thus twofold: It will problematize the use of expert judgments for content validation in general, but will also discuss the particular difficulty when the judges’ task is to make a clear construct distinction between syntactic and lexico-semantic knowledge, two categories that, by nature, overlap and have blurred boundaries.
This paper will present the findings of an examination of Shiotsu’s (2010) content analysis study. The paper briefly outlines the context of the study and Shiotsu’s (2010) findings in the original study in order to contextualize and facilitate interpretation of the insights gained from the present study. It then presents the results of replications of this study and discusses findings from using an alternative approach to judgment gathering to investigate whether Shiotsu’s (2010) test can be confirmed as an instrument that mainly measures syntactic knowledge and therefore has yielded trustworthy results that form the basis of claims about the relative significance of syntactic knowledge in L2 reading ability. Finally, the implications of the study for our understanding of L2 reading and for future research are discussed.
Context
Adopting a component model of reading rather than focusing on the cognitive processing, numerous researchers in the past have attempted to model L2 reading ability and explain the relative contribution of different components to reading, or rather reading test performance. Amongst these, “vocabulary and structural knowledge” (Grabe, 1991, p. 379) appears to be one of the most prominent components according to research. Nevertheless, it is generally agreed that vocabulary knowledge best predicts reading test performance.
Alderson (2000) states that “factor analytic studies of reading have consistently found a word knowledge factor on which vocabulary tests load highly” (p. 99) and that therefore vocabulary knowledge is an important predictor of variance in reading test performances (Qian, 2002). Baddeley et al. (1985), Dixon et al. (1988), Cunningham et al. (1990), Beck and McKeown (1991) and Daneman (1991) identified vocabulary as an important component of fluent L1 reading. Hacquebord (1989), Bossers (1989), Laufer (1992) and Schoonen et al. (1998) report similar findings for the L2 context. Yamashita (1999) also claims that L2 vocabulary knowledge surpasses L2 grammar knowledge in explaining L2 reading variance. Brisbois (1995), using grammar and vocabulary as independent predictor variables in her analysis, found that vocabulary measures showed a higher correlation with reading scores than did the grammar measure.
However, a recent paper by Shiotsu and Weir (2007) criticizes the methodological bias and shortcomings in previous studies and concludes that “the literature on the relative contribution of the grammar and vocabulary knowledge to reading performance is too limited to offer convincing evidence for supporting one or the other of the two predictors” (Shiotsu & Weir, 2007, p. 105).
Instead, Shiotsu and Weir claim that “the role of vocabulary appears somewhat overstated while that of grammar understated” (p. 104), which would be in accordance with studies by Alderson (1993b) and Bachman et al. (1989) which found that grammar tests do explain a substantial percentage of variance in reading test performances. Kaivanpanah and Zandi (2009) concluded from their findings that “syntactic behavior is more related to reading comprehension than vocabulary knowledge” with students’ scores on the TOEFL grammar test outperforming their scores on Qian and Schedl’s (2004) Depth of Vocabulary Knowledge Test as a predictor of reading test performance.
While it is beyond the remit of this paper to examine the methodology of all of these studies which appear to support Shiotsu’s (2010) and Shiotsu and Weir’s (2007) claim, the present paper will scrutinize the methodology and results of Shiotsu’s (2010) preliminary study, on which the claim is based that syntactic knowledge is a better predictor of L2 reading test performance than vocabulary knowledge. Only if Shiotsu’s (2010) test can be confirmed as measuring mainly, or exclusively, syntactic knowledge, can the results it has produced be regarded as a reliable basis for any claims regarding the construct of reading L2 ability.
The original study
Shiotsu (2010), in an investigation of the predictive power of a range of components of L2 reading, employed expert judges in a content analysis to legitimize a test of 35 items as a valid measure of syntactic knowledge. Shiotsu’s expert group consisted of “11 L1-English ELT experts with at least a master’s degree in applied or theoretical linguistics or TEFL” (Shiotsu, 2010, p. 62) and three Japanese lecturers of English syntax with at least a master’s degree in linguistics. Based on an adaptation of Bachman et al.’s rating scale for content analysis (1995), these experts were asked to evaluate the content of 35 discrete multiple-choice items. These items were collated from 15 past TOEFL (Test of English as a Foreign Language) Structure and Written Expression questions and 20 past TEEP (Test of English for Educational Purposes) grammar items. Judges were given this new test and were asked to indicate against each item whether they thought it was mainly testing (a) syntactic knowledge, (b) lexico-semantic knowledge, or (c) sentence comprehension (Shiotsu, 2010). Lexico-semantic knowledge was loosely defined as “knowledge of the meanings of certain words and phrases” (Shiotsu, 2010, p. 61), syntactic knowledge as “knowledge of sentence structures and that of acceptable sequences and forms of words in terms of syntax” (p. 61) and sentence comprehension as “understanding of the meaning of the overall sentence” (p. 61). No prototypical examples of the individual categories were provided to the judges. A summary of the judges’ responses is displayed in Table 1, where category A stands for syntactic knowledge, B for lexico-semantic knowledge, and C for sentence comprehension.
Results of Shiotsu’s original content analysis study (Shiotsu, 2010, p. 63).
Of the total 483 ratings, 1 331 were for syntactic knowledge (68.5%), indicating that this 35 item test tends to be one of syntactic knowledge overall. Shiotsu’s rationale for the inclusion of items in the final test is that any item for which the syntax category did not receive the highest number of votes should be excluded from the test as it appears to be testing something other than syntactic knowledge. For this reason, items 12, 18 and 21 were eliminated in the main study on the basis of these findings from the preliminary study (items highlighted in grey).
However, one could argue that this logic is hardly convincing and that only items for which the majority of judges (in this case at least eight) indicated that it is testing mainly syntactic knowledge should be legitimately included in the syntactic knowledge measure. This would mean that another five items would need to be eliminated from further analyses (items hatched). One could thus claim that for items 4, 6, 14 and 31, the judges were clearly undecided, while for item 10 the majority of judges thought that it is testing something other than syntactic knowledge. Importantly, therefore, it remains to be seen whether scores from a syntax test consisting of the 28 remaining items would have yielded findings similar to Shiotsu’s results and claims from his main study. 2
To investigate this further, five replications of the original study were conducted, three times with different groups of judges using the same methodology as the original study, twice using a slightly modified methodological approach. The results and implications of these studies will be presented in the following.
First replication study
In order to subject the findings of Shiotsu’s preliminary study to closer inspection, his content analysis study using expert judges was replicated. Twenty-one international participants in a workshop on diagnostic reading assessment, all involved in language test development at a national or institutional level, of whom 15 were also teaching at a university level, were asked to judge each of the 35 individual items as to whether they thought it was mainly testing (a) syntactic knowledge, (b) lexico-semantic knowledge, or (c) sentence comprehension.
The judgment procedure of the original study was replicated with no modifications so as to highlight the potential problematic nature of the distinction between the categories offered by Shiotsu. Although Weir, Hughes and Porter (1990) as well as Lumley (1993) maintain that inter-judge agreement could be increased by training the experts, this procedure was deliberately avoided in all replication studies as the authors did not want to clone judges through training (Alderson, 2000) but establish the problematic nature of the construct. As Alderson (2000) further states that a forced agreement through discussion in such a study would only show “the success of the cloning process” (p. 96) but not provide an unbiased picture of what experts actually thought the items were testing, just as in the reference study no training, discussion or category modification was offered to the judges. The results of this study can be found in Table 2.
Results of the first replication study.
The replication study confirmed the legitimacy of Shiotsu’s exclusion of items 12, 18 and 21 from the test. While the group of judges also found that this was in principle a test of syntax (437 out of 735 ratings included the syntax category), nine items clearly emerged as problematic (items 4, 10, 12, 14, 18, 21, 25, 28 and 31), even if Shiotsu’s questionable rationale for item exclusion was applied. In addition, item 33 does not show a clear tendency or majority vote in terms of its categorization. This replication study therefore indicates that items 10, 14, 25, 28, 31 and 33, which were all included in Shiotsu’s original test, cannot be unambiguously justified in this test of syntactic knowledge.
Second replication study
An international group of 19 language testing experts, all with at least a master’s degree in linguistics or applied linguistics was identified as potential judges for a second replication study. Employing the identical methodology as in the original and the first replication study, the test was sent out via email to the experts, 16 of whom returned the completed content analysis to the researchers. Of these 16 responses, two judges had to be removed from the analysis as it emerged from their responses and comments that they had not adhered to the instructions provided. The 14 remaining judges all had at least five years of professional experience in language testing, nine of them were L1 English speakers and eight of the 14 judges held a PhD in linguistics, applied linguistics or language testing. However, as Weir, Hughes and Porter (1990) state, “experience does not guarantee reliability of judgment” (p. 507) and it has yet to be shown which kind of expert group makes the best judges of item content. The number of judges in the second replication study was thus identical to the number of judges in the original study (N = 14). For this reason and also because different groups of expert judges came up with different results, the results of the individual studies were not collated but are reported separately in what follows. A summary of the results from the second replication study can be found in Table 3.
Results of the second replication study.
Of the total 497 3 ratings, 329 included the category syntactic knowledge (66.2%), confirming Shiotsu’s finding that this 35-item test tends to be one of syntactic knowledge overall. In addition, Shiotsu’s elimination of items 12, 18 and 21 appears to be justified as they were judged to be testing something other than syntactic knowledge.
However, applying Shiotsu’s rationale for excluding items if the number of votes for the syntax category is not the highest of the three categories, three more items would have to be eliminated from further analyses on the basis of the results of this replication study. Items 9, 14 and 25 emerge as being non-syntax items from the second replication study judgments.
Applying the alternative principle of a majority decision (eight judges or more) necessary for the inclusion of an item in the syntax test, another four items would not meet this criterion and would have to be eliminated. The summary of judgments for items 10, 28, 31 and 33 does not provide conclusive evidence that these are syntax items. It is therefore questionable whether these items are worth including in a measure of syntactic knowledge. This finding seems not only to buttress Alderson’s (1993a) above-mentioned caution but also to relativize and undermine Shiotsu’s results and the claims of the main study.
Third replication study
The results of the second replication study were then compared with another study by Alderson (2011) who also attempted to replicate Shiotsu’s original content analysis study. The results of this study, in which 14 university professors of applied linguistics participated, are shown in Table 4.
Results of the third replication study.
Table 4 probably best illustrates the insecurity of judges with this categorization task. Again, the findings seem to suggest that this 35-item test is in general one of syntactic knowledge, albeit not as convincingly as the results of the studies already discussed. Of the 490 ratings, 294 included the syntax category (60%). As in both the original study and the second replication study, the judgments for items 12, 18 and 21 clearly suggest that they should be excluded from this test of syntax as they are tapping into a different reading component.
However, as in the second replication study, ratings for item 14 would also suggest that, adhering to Shiotsu’s logic, this item should be eliminated as it is not clearly and convincingly an item testing syntactic knowledge. Items 9 and 25, identified as possible drop-outs in the second replication study, would not have to be eliminated on the basis of the judgments of the third replication study if Shiotsu’s principle for inclusion was applied. However, both of these items do not seem to be clear-cut syntax items, as Table 4 suggests. Applying the alternative principle of majority decisions to the judgments, these two (items 9 and 25) and another 8 items (items 2, 4, 10, 11, 26, 28, 31 and 33) would emerge as problematic, four of which (items 10, 28, 31 and 33) were also identified as problematic in the second replication study.
Fourth replication study – approximate replication with alternative rating approach
In a further study, the same test was sent out to a different group of expert judges, this time with a different rating grid. Since the two categories ‘syntactic knowledge’ and ‘lexico-semantic knowledge’ appeared to be predominantly used in the earlier replication studies, only these two categories were selected for study. Nineteen judges, comparable in expertise to those in the second replication study, were asked via email to indicate against each of the items what they thought each item was mainly testing, grading their judgment on a continuum of a 6-point Likert scale ranging from (1) ‘mainly syntactic knowledge’ to (6) ‘mainly lexico-semantic knowledge’. Eleven completed content analyses were returned to the researchers. One judge had to be removed from the file as the ratings and comments suggested that the instructions provided had not been adequately followed. The summary of the Likert scale judgments (mean scores) are shown in Table 5.
Results of the fourth replication study.
Since the rating scale ranged from 1 to 6, 3.5 was chosen as the cut score. An item showing a mean rating higher than 3.5 indicates that it tends to the lexico-semantic end of the continuum and thus suggests that it should not be included in a syntactic knowledge measure. Again, the results suggest that items 12 and 18 should be removed from the test as they are testing lexico-semantic knowledge rather than syntactic knowledge. Item 21, identified as a drop-out in the other investigations, does not emerge as problematic from these results. This, however, may be due to the fact that it was classified as a ‘sentence comprehension’ item in the other studies, for which there was no category available in this rating procedure.
However, item 14, identified as a potential drop-out in both the first and the second replication study and not convincingly justified as a syntax item in the original study, clearly emerges as a problematic item with a mean rating of 4.75. This strongly suggests that this item should be removed from the syntactic knowledge measure. Items 9, 25, 31 and 33, highlighted as potentially problematic in both the second and the third replication study, also have values well above the cut-score and might thus not be justifiable as items testing syntactic knowledge. The previous findings for item 28, which suggested that this item might also be removed from the syntactic knowledge measure, could not be confirmed in the fourth replication study. For item 2 the results of the third replication study, in which this item was identified as a potential drop-out, could be confirmed.
Fifth replication study
In a fifth replication study, the same group of judges as in replication study 1 were asked to judge the content of the test items again, this time classifying items according to the rating grid of replication study 4. The results are displayed in Table 6.
Results of the fifth replication study.
Again the cut-score was set at 3.5 as in the fourth replication study since an item showing a mean rating higher than 3.5 was found by the group to be testing lexico-semantic knowledge rather than syntactic knowledge. Interestingly, three items fewer than in the fourth replication study were found to be problematic. Items 2, 9 and 33 were rated by this group as tending to test syntactic knowledge. This echoes the judgments of this group on these items using the original classification grid. Also, item 10, identified clearly by the group as a lexico-semantic item in the first replication study, is just below the stipulated cut with a mean rating of 3.48.
Overview of results
Table 7 shows which items were identified as potential candidates for exclusion from the test of syntactic knowledge in question by the five replication studies outlined above. ‘XX’ marks items that should be excluded according to the original rationale of Shiotsu, that is, if the syntax category did not receive the highest number of votes. ‘X’ marks items that should be excluded according to an alternative principle, that is, if the syntax category did not receive the majority of votes. ‘?’ marks problematic items identified in studies four and five, which employed a slightly different methodology and thus also a different rationale for item exclusion.
Overview of results.
Contrary to Shiotsu’s study, which only yielded three problematic items, five replication studies, conducted using identical as well as similar methods of gathering expert judgments in content analyses in the original study, found that only 20 items out of 35 emerged as unproblematic and clear syntax items from all studies. Fifteen items in total were shown to require further scrutiny as they could clearly be questioned: items 2, 4, 6, 9, 10, 11, 12, 14, 18, 21, 25, 26, 28, 31 and 33 (see Appendix).
Three items (14, 25 and 31), which were not excluded in the original study, were identified as problematic by all five replication studies. As all five replication studies suggest that these three items cannot justifiably be maintained as syntax items, the legitimacy of the test in question and all results and claims based on it are questioned. An exclusion of at least these three items (in addition to the three originally excluded items) and a subsequent re-run of the original analysis, as well as a comparison of results against a re-analysis using the 20 remaining syntactic items, appears necessary as different findings might result.
How many judges are sufficient?
An important issue to address when using expert judgments is how many judges are needed to reliably classify items. Obviously, the larger the sample of judgments and the more agreement there is between judges the more accurate the classifications will be. With binary judgment responses, such as ‘does this item test this construct yes/no’, the properties of the binomial distribution can be used to allow the application of inferential statistics to the issue, via one-tailed lower-bound confidence intervals.
Due to the relatively small sample sizes inherent in the collection of judgment data the score confidence interval (Wilson, 1927) should be used to provide the intervals, over the more usual Wald test, in line with the recommendations of Agresti & Coull (1998). Table 8 gives the approximate 0.05 lower bound of the one-tailed score confidence interval for various numbers of judges at various levels of agreement. In order to use the table to assess the levels of agreement we have, we need to select a cut-off value for the minimum amount of agreement acceptable. In line with the studies in this paper a value of 50% minimum agreement has been chosen. In order to accept that an item has been reliably classified as having 50% minimum agreement with a level of confidence of 0.05 the value in the cell for the lower bound of the confidence interval must be 50% or greater.
One-tailed lower 0.05 score confidence interval.
For example, if six out of eight judges (75%) agreed that an item tested a specific construct, a value of ~41% would be read from Table 8 (down the 70%+ column and across the eight judges column). Thus, we could not say, with a level of confidence of 0.05, that if we collected more data the value for agreement would not drop below 50%. So, either, more data must be collected or the judges should not be considered to be providing sufficient evidence that the item is testing the specific construct, given the cut-off of 50%. However, if seven out of eight judges (88%) agreed then a value of ~51% would be read and we could accept the evidence, with a 95% level of confidence, that the item tested the specific construct.
This method of assessing agreement, if it were adopted as a standard, would have two main benefits. First of all, it would provide guidance to researchers as to the number of judges that they need for acceptable levels of confidence in their classifications. And, secondly, it would provide a standardized tool for researchers using judgments which would allow better comparability between the judgments gathered by different studies. However, it should be noted that this method tends towards the conservative; an observed agreement of 50% will never yield an acceptable result and minimum proportions of 5/5, 8/10 and 14/20 are required to accept agreements with a 0.05 level of confidence. 4
Discussion and conclusion
These results have several implications, not only for the study in question by Shiotsu (2010). One implication is a reinforced caution in conducting content analyses and content validation through expert judgments only. The second inference to be drawn from the five replication studies is that a much clearer definition is needed of what construct is being tested for both syntactic knowledge and lexico-semantic knowledge, at the very least for diagnostic purposes.
The first implication is not a novel finding. However, it is a finding still highly relevant and one that appears necessary to be replicated, given the fact that expert judgments are still used frequently, if not exclusively, to validate the content or construct of tests. If a sufficient number of studies had already made the problematic nature of ‘experts’ and their judgments in language testing clear, the field needs to ask itself why it is that ‘expert’ judgments are still often solely relied upon in these matters. The question becomes even more pertinent when considering that the literature has, over the past decades, suggested several alternative (statistical) methods of content validation (Buck & Tatsuoka, 1998; Lee & Sawaki, 2009, who propose using Q-matrices of ‘expert’ judgments against item attributes). But a Q-matrix is, after all, nothing more than a collection of human judgments and if the human judgments comprising the Q matrix are incorrect, the resulting diagnostic classifications will also be incorrect. However, it is rare for other empirical but non-statistical approaches to be employed instead of ‘expert’ judgments. The fact that the use of ‘expert’ judgments continues to be a widespread and recommended procedure (e.g. in large-scale assessments such as OECD PISA in 2000, 2003, 2006, 2009, as well as 2012), indicates that the findings of the present studies are still worth discussing in order to further raise awareness.
In addition, the qualifications, degree of expertise and reliability of judges in content validation studies need to be problematized as well. It is not at all clear exactly what criteria should be used to qualify judges as ‘experts’, and the authors are not aware of any studies, including the reference study, that would provide supporting evidence for such criteria or qualificatory credentials.
The fact that all ‘experts’ employed in the replication studies held a degree in linguistics or applied linguistics should ensure that the judgment categories were familiar and that “their qualification for the task should not be in doubt” (Weir, Hughes & Porter, 1990, p. 507), but the different amounts of experience within the expert groups should be acknowledged as a limitation of all such studies to date.
The second, arguably more important and more interesting insight concerns the construct of ‘grammar’ and, potentially, also ‘L2 reading ability’, since one would ideally require a clear distinction to be made between vocabulary and structural knowledge, particularly for diagnostic assessment. Such a clear distinction between syntactic and lexico-semantic knowledge, however, might be neither achievable nor desirable since several linguists have argued for abandoning the vocabulary–grammar dichotomy. Although lexis and grammar have traditionally been kept apart, evidence from corpus linguistics suggests that vocabulary and grammar, because of the highly patterned structure of language, “are in fact inseparable” (Römer, 2009, p. 141). Römer argues that the traditional grammar–lexicon dichotomy “may hold true for sentences which have been invented in order to illustrate it, but it collapses when we consult real language data” (2009, p. 142). Lewis (1993) claims that “the grammar/vocabulary dichotomy is invalid” (p. vi) and argues that “language consists of grammaticalised lexis” (p. vi). Lewis (1993) further maintains that “dichotomies simplify, but at the expense of suppression” (p. 37) and suggests placing “lexical items” (p. 89), that is, words, multi-word units, polywords (e.g. phrasal verbs) or collocations, on a cline or scale instead. Nattinger and DeCarrico (1992) claim that “lexical phrases [are] form/function composites, lexico-grammatical units that occupy a position somewhere between the traditional poles of lexicon and syntax” (p. 36). Sinclair (2004) asserts that “so strong are the co-occurrence tendencies of words, word classes, meanings and attitudes that we must widen our horizons and expect the units of meaning to be much more extensive and varied than is seen in a single word” (p. 39), suggesting that traditional tests of vocabulary employed to investigate the contribution of vocabulary knowledge to reading ability only paint half the picture and that “lexicogrammar” (Sinclair, 2004, p. 39) should perhaps instead be treated as a unitary component of reading ability rather than attempting to distinguish between vocabulary and grammar.
This concern about the real divisibility of the two components has already been raised by Shiotsu and Weir (2007) and Brunfaut (2008) in comments on the original study in question. Findings from other studies investigating the relative contribution of vocabulary knowledge and grammar knowledge appear to confirm that the relation between syntax and lexis is a continuum, as researchers have consistently found high correlations between the two components (Brisbois, 1995; Shiotsu & Weir, 2007; Brunfaut, 2008). It might therefore be of interest for future research to construct tests of lexis in a more phraseological approach and to examine tests of formulaic sequences or multi-word units to see whether they would account for the same amount of variance in reading test performances as traditional vocabulary and grammar measures taken together. In any case, testers and applied linguists need to recognize the slipperiness of the slope between the constructs and need to qualify or describe their dichotomies. Using Likert scales symbolizing this continuum as operationalized in replication studies 4 and 5 instead of categorical classifications might be a first step towards this but further replication studies and increased problematization of judgments employed in research are needed.
Most importantly, future research needs to define its constructs better, needs to avoid simplistic statements to the effect that Grammar is more important than Vocabulary, but rather should make more nuanced and properly researched statements about which aspects of which constructs seem more or less relevant to predicting reading ability in a second language.
Footnotes
Appendix A: Problematic test items ( Shiotsu,2010 )
Appendix B: Test items judged to test syntactic knowledge ( Shiotsu,2010 )
Acknowledgements
We wish to thank Ari Huhta and Tineke Brunfaut as well as the judges who took part in the various replication studies, and the anonymous reviewers for their valuable feedback on earlier versions of this paper. Part of this paper is based on a Master’s dissertation submitted to Lancaster University, UK in December 2012.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
