Abstract
This study compared three common vocabulary test formats, the Yes/No test, the Vocabulary Knowledge Scale (VKS), and the Vocabulary Levels Test (VLT), as measures of vocabulary difficulty. Vocabulary difficulty was defined as the item difficulty estimated through Item Response Theory (IRT) analysis. Three tests were given to 165 Japanese students, resulting in five measures of vocabulary knowledge and four measures of word difficulty. Analyses included item and score factor analysis, unidimensionality, local independence, and correlations. Results indicate that these are reliable tests. Tests of unidimensionality suggest these tests are essentially measuring one major latent trait, which can be interpreted as a factor for word knowledge. Strong correlations of the scores with each other provide evidence of concurrent validity, and for the interpretation of the scores as indicative of word knowledge. Correlations with other methods of estimating word difficulty, such as transformed frequency, length of word, or number of syllables, suggest that of these methods, the log of frequencies from very large corpora gives the best estimate of word difficulty. However, direct testing of vocabulary difficulty appears to, in the words of Kreuz (1987), “provide a better account of recognition latencies than do counts based on printed word frequency” (p. 159).
Keywords
Word difficulty is frequently encountered in a variety of studies, ranging from the acquisition of vocabulary, to the effect of vocabulary size on comprehension, and even in the calculation of various readability formulae. However, in most of this research, word difficulty is not explicitly studied. In studies that are interested in the ability of the participants, that is, the participants’ vocabulary knowledge, the researchers treat item difficulty as a nuisance variable and go to some lengths to minimize the effect through the use of classical and Item Response Theory (IRT) item analysis. In this study, however, word difficulty is the subject of interest. Word difficulty, as defined here, does not include the intrinsic difficulty of the word; that is, the amount of effort needed to learn a word (Higa, 1965; Laufer, 1990, 1991, 1997). Word difficulty is defined in this study by the statistical analysis of items on vocabulary tests, and is measured using IRT estimation procedures.
The study of vocabulary difficulty has a long past in native language studies. Dolch (1932) clearly stated this objective with, “the basic fact to be remembered in testing for word difficulty is that we are testing the words and not the children” (p. 22). Tharp (1940) developed an index of difficulty, which consisted of dividing a frequency index by the number of times a word occurred in a given text. He also reported on a word familiarity study of 8000 words. The method was not discussed. In 1949, Kirkpatrick and Cureton looked at item difficulty from a vocabulary test, correlated it with frequency counts, and found correlations of .47 and .56. They also found that syllable counts correlated at only .20 to item difficulties, commenting, “it seems reasonable to suggest, on the basis of this latter finding, that the authors of ‘readability’ formulas investigate the relative merits of counting syllables as against having a single judge estimate word-difficulties” (Kirkpatrick & Cureton, 1949, p. 349). Kreuz (1987) compared word familiarity measured with a 7-point Likert scale, a lexical decision task, and printed frequencies and found that “subjective familiarity ratings provide a better account of recognition latencies than do counts based on printed word frequency” (p. 159).
In this paper, I will begin by briefly discussing how Item Response Theory allows an alternative approach to word difficulty and then examine three vocabulary test formats to determine their suitability for IRT analysis and the study of word difficulty.
Item Response Theory and word difficulty
Word difficulty, defined in this study by item difficulty, and seen here as a measure of the participants’ familiarity with the word, can be measured through the use of Item Response Theory. Item Response Theory attempts to explain the response of a person to an item with a probabilistic model. In its simplest form, Item Response Theory posits that the probability of a random person with ability θ answering a random item with difficulty b correctly is conditioned upon the ability of the person and the difficulty of the item. In other words, if a person has a high ability in a particular field, he or she will probably get an easy item correct. Conversely, if a person has a low ability and the item is difficult, he or she will probably get the item wrong. For example, we would expect someone with a large vocabulary to say they know words like age and beautiful but we could not expect someone with a small vocabulary to know words like subsidy or dissipate.
One of the fundamental properties of IRT is the invariance of the parameters. What this means is that any test can determine the ability of the student regardless of the difficulty of the items on the test, provided that the item parameters are known. Both ability and difficulty are expressed in logits, the log of the odds unit. When the items are of the same difficulty as the person’s ability, the student will get 50% correct and 50% wrong. In other words, if a student took a test composed of items with difficulties of 1 logit, and she got 50% correct, we could assume she has an ability of 1. If she scored 88% correct on a test made of items with difficulties of −1, we would also assume her ability to be 1, as we would if she scored 27% on a test with items of 2 logit difficulties. A score of 27%, 50%, or 88% could result in the same ability estimate depending on the difficulties of the items. This same property is mirrored for the items. Any test can determine the difficulty of the items as long as the students’ abilities are known. Therefore, it is possible to measure the difficulty of thousands of words by using staggered multiple forms of a vocabulary test.
The benefits of having difficulty estimates for thousands of words are threefold. First, they can be used to assess vocabulary size. Once we have the ability of the student, we can calculate the probability of that student knowing each and every word in our sample. Then, it is a simple matter of adding the probabilities to get the lexical size. Second, difficulty estimates can be used to help sequence a pedagogical program of vocabulary development. Through testing a sample of the words, the student’s ability can be determined. Once the student’s ability is known, the probability of each word being known can be calculated. With this information, we can determine what words should be excluded because they have a high probability of being known, and we can identify any high-frequency words that are probably unknown, thus reducing the risk of these important words being missed. This approach can be used with the individual student or with a class average. Finally, word difficulty estimates can be used for grading the lexical burden in learning material. With the reader’s ability estimate, and the difficulty of the words in the passage, it will be possible to calculate the overall lexical difficulty of the text for the reader, and to rank texts according to the lexical burden they will have on the reader. It may also be possible to identify those words most likely to cause problems with comprehension, so that those words can be pre-taught. These three benefits are just a few of the reasons why a difficulty index for most of the high-frequency words in a language is a sound pedagogical endeavor.
Research questions
How reliable and valid are the three tests of vocabulary knowledge for measuring vocabulary ability?
How reliable and valid are the three tests of vocabulary knowledge for measuring vocabulary difficulty?
Do the three tests meet the criteria of unidimensionality and local independence necessary for analysis using Item Response Theory?
What is the relationship between word familiarity, as measured by the three tests, and other measures used to estimate word familiarity such as frequency and structural complexity?
Method
Participants
The tests were administered to 78 students at a Japanese university (26 females, 52 males) and 87 students attending a two-year women’s junior college (all female) for a total of 165 participants in this study. All students were in their first year. They range in age from 18 to 20 years. The students were told they were taking part in a study to measure vocabulary size and word difficulty.
Materials
The materials for this study consisted of three tests of the University Word List (Nation, 1990; Xue & Nation, 1984), taken from Beglar and Hunt’s (1999) two revised forms of the University Word Vocabulary Level test. The University Word List (UWL) is a list of 807 word families drawn from written academic texts based on frequency and range of disciplines (see Nation, 1990; Xue & Nation, 1984). On each form of the test, there are 54 words matched to 27 definitions. One distractor, fraction, was used twice on the Level tests and was replaced by friction, drawn from the same frequency group of the UWL (Nation, 1990) yielding 108 words.
Yes/No test
On the Yes/No test 1 (Y/N), the participants read the words on a word list and responded by filling in a bubble font version of a Y or an N on separate mark sheets. Nonwords were included to adjust for guessing. On this test, all 108 words were included plus 52 nonwords for a total of 160 items. The canonical forms of the nonwords were selected from a list generated by the ARC nonword database (Rastle, Harrington, & Coltheart, 2002).
One major advantage of the Y/N is that the participants can respond to a large number of items in a relatively short time. As a result, a larger sample of the language can be tested. A second advantage is its ease of production (Meara & Buxton, 1987). Item writers do not have to be trained to produce a satisfactory test. A third advantage is that the item response is relatively straightforward and does not require the test-taker to perform any additional tasks (Anderson & Freebody, 1983). The primary disadvantage of the Y/N is with the dichotomizing of the concept “knowing a word.” For one participant, being familiar with the word may be enough, while another participant may respond based on the knowledge of the meaning (Anderson & Freebody, 1983). This one characteristic makes the Y/N unsuitable for pretest and posttest design (Paul, Stallman, & O’Rourke, 1990). Another disadvantage is the inability to test for the various meanings of polysemous words (Anderson & Freebody, 1983). A third disadvantage is with the construction of the nonwords and their similarity to real words (Cameron, 2002; Eyckmans, 2004). A fourth disadvantage, which is shared by all the tests in this study, is that the words are not presented within context (Read, 1997).
Scoring the Y/N is usually done using the correction-for-guessing formula (Equation 1) introduced by Anderson and Freebody (1983). The proportion of known words was derived based on the following equation,
where P(K) is the proportion of words known, P(H) is the proportion of Hits, or real words marked as known, and P(FA) is the proportion of False Alarms or nonwords marked as known. Scores are typically reported as the proportion of known words, that is, the P(K) (Anderson & Freebody, 1983), as the percentage correct, that is, the P(K) times 100 (Paul et al., 1990), or as the estimate of the number of target words known (Meara & Buxton, 1987). Criticisms of this testing method have centered on its assumption of decisions made regarding False Alarms (Macmillan & Creelman, 2005) and its ability, or lack of, to account for examinee bias toward Yes answers (Beeckmans, Eyckmans, Janssens, Dufranne, & Van de Velde, 2001; Eyckmans, 2004; Huibregtse, Admiraal, & Meara, 2002). There also appears to be an L1 influence on the False Alarms rates (Eyckmans, 2004).
Vocabulary Knowledge Scale
In order to avoid the dichotomization of word knowledge, the use of scales has been suggested (Stoller & Grabe, 1993). The Vocabulary Knowledge Scale (VKS) employs two such scales, a self-report, and a scored assessment (Paribakht & Wesche, 1997; Wesche & Paribakht, 1996). The participant is required to report how well he or she knows the word based on self-report categories. If they report knowing the word, they must then demonstrate their knowledge with a synonym or translation, and, in the highest category, with a sentence. The words on this test were taken from Form A of the Level test, including the 27 tested words, 27 distractors (including friction; see above) plus 26 nonwords taken from the Y/N for a total of 80 items. The distractors were included so as not to prompt the subsequent Level test. The nonwords were included as part of an investigation on the measurement of the guessing parameter on the Y/N.
Paribakht and Wesche suggested two ways to score the VKS. First, a self-report score can be derived from the sum of a one-to-one transformation of the self-report categories. Second, the synonyms, translations, and/or sentences can be hand-scored (see Paribakht & Wesche, 1997, p. 181) and totaled for each participant.
The principal advantage of the VKS is that it bridges the gap between self-report and demonstrated knowledge. Like interviews, the VKS is sensitive to partial word knowledge and individual differences (Paul et al., 1990). Unlike interviews, the raters do not have to be present during the data collection period. However, this eliminates the chance for raters to prompt the students (Anderson & Freebody, 1983). The disadvantages of the VKS center on the practicality of the tests, and the knowledge burden on the examinees. The first disadvantage of the VKS is in the method of scoring the test. Because the participants can respond with a translation or a synonym, scoring the VKS necessitates bilingual raters, and by extension, the study of interrater reliability (Paul et al., 1990). A second disadvantage of the VKS for data collection is that there is a considerable variance in the quantity and quality of output required for the different response options. With the Y/N and the VLT, the same amount of output is required whether the word was known or not. With the VKS, however, participants who indicate that they do not know the word need only fill in the appropriate bubble, while those who report having the highest level of knowledge must fill in the bubble, write a translation or synonym, and a sentence. For those participants with a high level of knowledge, the test is considerably longer and requires more output than for those with a low level of knowledge. Given that the data in this study were collected for research purposes and not for grades, this knowledge burden may have exerted an influence over the response. In terms of taking the test and scoring the results, this test was by far the most time consuming and least economical.
Criticism of the VKS centers on the interpretation of the scores. While the self-report score can be considered as solely a measure of receptive knowledge, the scored version of the VKS may be criticized for including measures of both receptive and productive knowledge, that is, while the use of the word itself tests receptive knowledge, knowledge of how to use it in a sentence requires some productive knowledge. Stewart, Batty, and Bovee (2012) found the VKS to be weakly multidimensional. Read (1997) questioned whether the categories represent stages of acquisition or are of equal intervals. He also questioned the appropriateness of the task, a criticism that can be leveled at most, if not all, vocabulary tests.
Vocabulary Levels Test
The Vocabulary Levels Test (VLT) was introduced by Nation (1983, 1990) with substantial revision made by Beglar and Hunt (1999) and Schmitt, Schmitt, and Clapham (2001). The VLT is essentially a reformatting of the synonym or definition to word-matching tasks often found in standard multiple-choice questions. Instead of one synonym or definition prompt with four options, the VLT groups three definitions together with six choices. Participants indicate knowledge by matching the definitions with the target words. Participants took both forms of the revised VLT with 27 tested items per form for a total of 54 words. Scoring for this test is relatively straightforward with the reported score representing the sum of the correct answers.
The advantages of the VLT are that the students can answer many questions in a relatively short period of time, the answers are amenable to using mark sheet readers, and, as with most multiple-choice questions, high reliability can be achieved. The disadvantages are also those shared with standard one-to-four multiple choice questions in that the distractors can have a considerable influence on the outcome of the test (Anderson & Freebody, 1983; Paul et al., 1990). Another disadvantage from a statistical point of view is that the items may lack local independence; that is, because the distractors are recycled, one item can have an influence on the other items. Nation has suggested that the 18 items on the tests are actually measuring 36 words (personal communication), and there is evidence that he is correct. However, this would also be evidence of a violation of local independence and undermine the suitability of an IRT analysis of the VLT.
Procedure
All tests were administered to the participants during one 85-minute session. To control for learning during the test, the participants took the Y/N first, followed by the VKS and finally the VLT. The Y/N and VLT mark sheets were scanned by computer. The self-report responses on the VKS were input by hand onto a spreadsheet, and then the tests were sent to two raters for scoring, the first a native speaker of English with high Japanese proficiency, and the second a native speaker of Japanese with high English proficiency. Based on a trial test, training sets and a training manual were developed to facilitate scoring and increase inter-rater reliability.
The tests were scored using the accepted practices for the type of test. The results from the Y/N were calculated using the correction-for-guessing formula (Equation 1). The VKS contributed three scores, VKS, Rater 1, and Rater 2. The VKS was calculated directly from the participants’ self-reports without regard to the correctness of their responses. Rater 1 and Rater 2 scores were derived from the total score assessed by the raters. Only the results from the 54 real words on the VKS are reported here. The VLT scores were the sum of the number of correct responses.
Results
Descriptive statistics
Initial data analysis consisted of examining the data from the 165 participants who took the three tests for accuracy of data entry and fit between the data distributions and the assumptions of multivariate analysis. These descriptive statistics are presented in Table 1. For the Y/N statistics, the estimated proportion of known words multiplied by the number of real words was used. Therefore, the maximum possible score was 108 items correct. The mean of the Y/N reported below was the average P(K) (0.53) multiplied by 108. The mean for the number of Hits was 61.81, or a 0.57 proportional score, and 5.61 (0.11) for the False Alarms. For the VKS, Rater 1, and Rater 2, with each item scored on a five-point scale, the maximum scores were 270 points, and the proportional scores for the means were 0.57, 0.48, and 0.48, respectively. For the VLT, the maximum score was 54 points, and the proportional score for the mean was .60.
Descriptive statistics for three tests of vocabulary knowledge.
The reliability coefficients reported here are all Cronbach’s α except for the VLT, which used K-R 20. Cronbach’s α was calculated with split-halves, using odd and even item numbers. K-R 20 reliability coefficients are sometimes reported for the Y/N, using responses on the real words for the item variance (Shillaw, 1996). If the test and item variance is measured using only the responses to the real words, the K-R 20 for the test is .95; while if the test variance is derived from the estimated number of known words and the sum of the item variances includes item variance from the nonwords, the K-R 20 is .94. Rater 1 and 2 are reported here as separate measures. The correlation coefficient between the two raters, a measure of interrater reliability, was 0.95, showing high agreement between the two raters. When responses in the first two categories are excluded from the calculation, the interrater reliability coefficient becomes .98. Given the high correlation of the scores, and that under normal circumstances, a second rater would only grade a percentage of the papers, I choose to use the results obtained from Rater 1 for subsequent analysis. The reliabilities reported here are in line with other published results when adjusted for the number of items on the tests.
The data were examined for normality of distribution. In the histogram, the Y/N appeared to have positive skewness but this was found to be not significant at z = 0.219 for a conventional but conservative alpha of .01 for a two-tailed test (Tabachnick & Fidell, 1996). When examined for kurtosis and skewness, all tests were found to be not significant at p < .01.
Dimensionality
The analysis of item responses under the rubric of IRT assumes two related characteristics of the data. One assumption of many IRT models, including the one, two, and three parameter logistic models, is that the item responses are unidimensional, which is to say that the items are measuring one cognitive trait. Another assumption is that the items are locally independent which means that once the unidimensional trait is accounted for in the data, all subsequent variations in the item responses must be independent of each other (Hambleton, Swaminathan, & Rogers, 1991). Central to this concept is that the responses from one item do not have an effect on other items. Ackerman (1987) found that strong dependency between items leads to item difficulty estimates correlating negatively with simulated data parameters.
Violations of this premise can be demonstrated in the responses on the VLT. In the item cluster shown in Figure 1, the item facilities (IFs) for the VLT are listed in parentheses after the definitions in the left column, while the item facilities for the Y/N are listed after the word choices in the right column. Students identified the distractors focus, volume, and mathematics as being well known, suggesting that they could be eliminated. The target word, section, was also identified as being well known on the Y/N but less so on the VLT. An analysis of the distractors shows 9.1% of the respondents choosing volume, possibly drawn to the collocation larger. By eliminating four distractors, guessing the answers for the two less known words, intimacy and doctrine, effectively becomes a 50% chance of guessing correctly, as indicated by the differences in item facilities. This indicates that not only the trait of vocabulary knowledge is a factor in the probability of a correct response, but also the difficulty of the words in the cluster. Thus, it is possible that while the items may have the property of unidimensionality, the influence of the options on individual items affects their outcome and the items lack local independence.

VLT item cluster with item facilities from the Yes/No test and the Vocabulary Levels Test.
While the distractors have an effect on the item, it does not appear to be a straightforward situation of guessing, as we would expect both words to benefit equally. The large increase in the item facility for intimacy suggests something else is at work such as the activation of some residual memory by the definition. It is interesting to note that if the IFs are taken as a rough indication of the probability of a student knowing a word, their sum can be used to find out how many words they would know in the cluster. Of the three words tested by the definitions, the sum of the IFs comes to 1.87 words or 62% of the cluster. Applying the same logic to the six options, a sum of 4.34 words or 72% is obtained. This suggests that the cluster may be working well as a unit to measure lexical knowledge.
Item factor analysis
One of the first, and most common, methods to investigate dimensionality is the factor analysis of the item correlation coefficients. The scree plots generated by MicroFACT (Waller, 1995) of the initial correlation matrix using a principal components analysis were examined for the presence of an elbow and all tests seem to have one major factor that accounted for most of the variation, with one or two minor factors. The prominence of a major factor can also be seen in Table 2. The first three rows show the eigenvalues of the three largest extracted factors from each test, with the first eigenvalue representing the proportion of variance accounted for by the first factor. The fourth row shows the sum of the eigenvalues that represents the proportions of all variance explained by the factors and is equal to the sum of the communalities. The fifth row shows the percentage of total variance that is accounted for by the first factor.
Eigenvalues of principal axes factor analysis.
The first eigenvalue, taken from the principal axes factor analysis, fits Lord’s (1980, p. 19) criteria in that it is much bigger than the second eigenvalue, and that the second is not much larger than the others are. Using Reckase’s (1979) criteria of unidimensionality, the first factor should account for more than 20% of the variance, which with a principal axes factor analysis is equivalent to the first eigenvalue divided by the sum of the communalities. As can been seen from Table 2, all tests meet Reckase’s criterion.
The Rasch Model offers an alternative way to investigate dimensionality through the principal components analysis of residuals (PCAR). Residuals are the differences between the actual observed scores and the expected value of the scores based on the model. The difficulty dimension accounted for 92.7% of the variance of the Y/N while only 0.4% of the unexplained variance was accounted for by the largest secondary dimension, the first factor in the residuals. For the VKS, Rater 1, and the VLT respectively the variance explained by the difficulty dimension amounted to 80.5%, 66.5%, and 50.8%. The unexplained variance amounted to 1.9%, 3.6%, and 2.4%, respectively. The results of the principal components analysis of the residuals very strongly suggested that no additional structures were present in the Y/N. This provides evidence that the Y/N is the most unidimensional while the VLT is the least. Eigenvalues present another picture however. For the Y/N, the first factor in the residuals accounted for 5.9 out of a total unexplained variance of 108. For the VKS, Rater 1, and the VLT respectively, the eigenvalues were 5.3, 5.8, and 2.6, all out of a total unexplained variance of 54. Since Linacre (2005) suggested that “if the first factor has ‘units’ (i.e., eigenvalue) less than 3 (for a reasonable length test) then the test is probably unidimensional” (p. 294), only the VLT meets that criteria. Under the Rasch paradigm, the objective is to identify those items that do not fit the model for subsequent deletion or incorporation into a separate metric. It is possible that using a less restrictive model would reduce these factors in the residual matrix.
DIMTEST analysis
Another test for unidimensionality was developed by Stout and associates (Nandakumar & Stout, 1993; Nandakumar, Yu, Li, & Stout, 1998; Stout, 1987, 1990), based on Stout’s definition of essential dimensionality and essential independence (Stout, 1987, 1990). They developed two programs, DIMTEST (1993) for binary data (Nandakumar & Stout, 1993; Stout, 1987) and Poly-DIMTEST for polytomous response data (Li & Stout, 1995). The null hypothesis of essential unidimensionality is rejected if Stout’s T exceeds the upper 100(1 − α) percentile (Stout, Douglas, Junker, & Roussos, 1993), that is, when the test is one-tailed. Setting the α at .05, the 100(1 − .05), or 95th, percentile on a normal distribution is 1.65. T statistics that are less than this are seen as evidence of unidimensionality. Because of this directionality, T statistics of large negative numbers are also interpreted as evidence for unidimensionality. The null hypothesis of unidimensionality was not rejected for any of the tests, with the following statistics: Y/N (T = 0.337, p > .05), the VKS (T = −1.467, p > .05), Rater 1 (T = 0.882, p > .05) and the VLT (T = 0.119, p > .05).
Correlation matrix of test scores
Table 3 shows a correlation matrix of the scores from each of the tests. The correlation indices above the diagonal of unities (1.00s) and the coefficients of determination (R2) below are derived from the total scores on each test; that is, the Y/N results come from all 160 items corrected for guessing using Equation 1, the VKS from 54, and the VLT from 54 items. Given the history and theoretical background of these vocabulary tests, positive correlations are expected, and so a one-tailed test of significance was used. All correlations are significant at p < .01. The indices reveal that the highest correlations are between the VKS and those measures that are derived from it. Those correlation indices are reported here only in the interest of comparing them to the other measures, as they would pose a problem with singularity if used for any other purpose. The Y/N has its highest correlation with the VKS at .79, while the lowest correlation is between the Y/N and the VLT at .63. The correlation indices reported here can be compared with Mochida and Harrington’s study, which took a similar approach. They found higher correlations between the VLT and the Y/Ns, reporting correlation indices ranging from .80 to .87 using the correction-for-guessing formula.
Raw score correlation matrix above the diagonal and coefficients of determination below.
Note: N = 165. All correlations are significant at p < .01 level (one-tailed).
Based on the evidence presented above, the adoption of these formats can lead to reliable and valid measures of the individual’s vocabulary knowledge. High internal consistency coefficients indicate that these can be reliable tests. The high loadings on one factor and the tests of unidimensionality suggest these tests are essentially measuring one major latent trait, which can be interpreted as a factor for word knowledge. The strong correlations of the scores with each other provide evidence of the concurrent validity of the measures, and to the interpretation of the scores as indicative of word knowledge.
Factor analysis of ability estimates
A principal axis factor analysis was performed on the Y/N, Rater 1, and VLT. The first eigenvalue, at 2.38 accounted for 79.20% of variance, the second 12.63%, and the third, 8.16%, establishing the unidimensional nature of the data. The loadings on the extracted factor were 0.80 for the Y/N, 0.91 for Rater 1, and 0.78 for the VLT.
Different measures of word difficulty
Having looked at these three tests in the conventional manner from the point of view of the examinee, I will now compare the item difficulty estimated for item responses to other methods of estimating word difficulty including orthographic features and frequency counts. As stated earlier, readability formulae used orthographic features such as number of letters or syllables to denote lexical difficulty. The Miyazaki EFL readability index (Greenfield, 2004) includes the letters per word as one variable. The Gunning-Fog Index, the Fog count, and Brown’s (1998) EFL readability index use the number of words seven or more letters long. The Flesch Reading Ease, Flesch-Kincaid, and Fry use syllables/words as their measure of lexical complexity, while the Gunning-Fog Index uses the percentage of words over two syllables (see Brown, 1998). Variables drawn from orthographic features include the number of letters (NLET) in the word, the number of phonemes (NPHN), and the number of syllables (NSYL).
Descriptive statistics for these variables are presented in Table 4. Frequency is used as a variable for lexical familiarity by one measure of text difficulty. The Lexile text measure (Stenner & Burdick, 1997) uses the log of the word frequencies taken from the American Heritage Word Frequency Book (Carroll, Davies, & Richman, 1971). Frequency also plays a part in the analysis of text to determine its difficulty under the assumption that more high-frequency words equal an easier text. Frequency data were obtained from various corpora, including the Kučera and Francis’ analysis of the Brown corpus (KFFRQ) and from Thorndike and Lorge research (TLFRQ). These first five columns of data, NLET, NPHN, NSYL, KFFRQ, and TLFRQ, were obtained from the MRC psycholinguistic database (Wilson, 1988). Columns 6, 7, and 8 also contain frequency data, column 6 (BNC) is from the British National Corpus (Leech, Rayson, & Wilson, 2001), column 7 (LOB) is from Lancaster-Oslo/Bergen (Johansson & Hofland, 1989), and column 8 (AHF) is from the American Heritage Word Frequency Book (Carroll et al., 1971). Unlike KFFRQ, and TLFRQ, which includes only data for the word type, the data for these words are based on the lemma. In counts such as the LOB and AHF, word types for the corresponding lexemes were entered into a spreadsheet and totaled. Histograms and the descriptive statistics confirmed the presents of skewness in the NSLY and in all the frequency data. The NSLY data were transformed by taking the common log of the number of syllables. For the raw frequency data, all counts were transformed using Carroll’s Standard Frequency Index (Carroll, 1971), a method that adds a constant to the common log of the frequency before multiplying by another constant. After the lognormal transformation was performed, no skewedness and kurtosis was significant.
Descriptive statistics for word difficulty estimates derived from orthographic features, frequency, and item responses.
The last four columns summarize the difficulty estimates calculated by the Winsteps program (Linacre, 2005). For the Y/N and the VLT, the difficulty parameters are estimated using a one-parameter response model for dichotomous data. The Y/N data were composed solely from the Hit responses, that is, the “Yes” responses to real words without correcting for the effects of guessing. For the VKS and Rater 1 data, which are polytomous, IRT programs such as Quest (Adams & Khoo, 1996) give an item difficulty for each step. This is difficult to interpret regarding word knowledge, and problematic given that the objective of this research is to assign a single difficulty to each word. The solution to this problem is to re-score the data to make them dichotomous, and calibrate the item difficulties from these data.
In Table 5, each column contains the correlation coefficients above the diagonal of ones, and the number of words in the data set on the diagonal. The number of words varies between 108 words on the Y/N to 27 words, the overlapping items between the VKS and the VLT. The column headings are as described above.
Coefficients of correlation of various word difficulty estimation methods above the diagonal, number of words in the data set along the diagonal, and number of words used in the correlation below.
Note: *Correlations are significant at p < .05 level (two-tailed).
Factor analysis of difficulty estimates
An examination of the data using SPSS showed satisfactory linearity with no evidence of curvilinearity or heteroscedasticity in bivariate scatterplots. All z scores were found to be within 3.29 (p < .001). No multivariate outlier with a Mahalanobis distance greater than χ2(3) = 16.266 at p < .001 was found. The Kaiser-Meyer-Olkin test of sampling adequacy for factorability of the correlation matrix was .67. Given the high correlation between the Y/N and Rater 1, the squared multiple correlations of each variable with the other two variables were tested for multicollinearity. While the squared multiple correlations are .86 for the Y/N, .87 for Rater 1, and .45 for the VLT, and the conditioning index is less than 30, both the Y/N and Rater 1 have variance proportions greater than .50 associated with the same root; thus multicollinearity does not appear to be a problem.
A principal axis factor analysis was performed on the Y/N, Rater 1, and VLT item difficulty estimates. The first eigenvalue, at 2.50, accounted for 83.22% of variance, the second 14.45%, and the third 2.32%. The loadings on the extracted factor were .93 for the Y/N, .99 for Rater 1, and .68 for the VLT.
Discussion
The results displayed in Table 5 clearly show that estimates of word difficulty based on orthographic features or frequency do not do a good job at predicting difficulty measured from item responses. While the orthographic features correlated highly with each other, they had only moderate correlations with other indices ranging from −.24 to −.41 for frequency data to .14 (ns) to .54 for the test items. Frequency data performed marginally better with correlation indices ranging from −.19 (ns) to −.71 with the item responses, the negative relationship showing that as frequency increases, difficulty goes down. The largest correlations between frequency counts and estimates based on item responses were with the BNC, thus showing that in corpus work, bigger is better.
Results from the VKS seem to support the assumption that the students can and do misrepresent their knowledge of words. It appears that a method to correct for this trait is necessary when Y/Ns are administered to ascertain estimates of the vocabulary ability of students. Anderson and Freebody (1983) proposed an IRT model with two person parameters, the first parameter a measure of overall word knowledge, and the second parameter a measure of the student’s judgmental standards. Unfortunately, the intended model was not developed. However, this may not be problematic for the estimation of word difficulty. The evidence seems to suggest that current methods of estimating item difficulty are sufficient. The strong evidence for unidimensionality, reliability and validity, coupled with the practicality of the Y/N make it an ideal candidate for the gathering of data to provide statistical estimates of the difficulty of words as perceived by Japanese learners of English.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
