Abstract
Differential skill functioning (DSF) exists when examinees from different groups have different probabilities of successful performance in a certain subskill underlying the measured construct, given that they have the same ability on the overall construct. Using a DSF approach, this study examined the differences between two native language groups – a group with an East Asian language background and one with a Romance language background – in regard to reading subskills as represented in the Michigan English Language Assessment Battery (MELAB) reading test. Based on a combination of literature review and think-aloud reports from a sample of ESL students, hypotheses on reading subskill differences between the two groups were generated. These hypotheses were tested by first identifying the subskill profile of each examinee in a large MELAB database via the application of a previously determined item-skill Q-matrix to a Fusion Model of cognitive diagnostic modeling. The subskill profiles of the East Asian examinees were then compared against those of examinees with a Romance language background through logistic regression techniques. Some important DSFs were found between the two groups. Based on results of this study, instructional strategies were suggested to address some specific weaknesses in ESL learners’ reading subskills.
Introduction
Cross-linguistic researchers contend that the cognitive mechanism used in language processing differs across languages and thus is language-specific (Koda, 2005). It is accepted, therefore, that English as a second language (ESL) learners with different native language backgrounds behave in different ways when learning the same foreign language (Ringbom, 1987). It has also been observed that ESL learners from East Asian countries, especially China, Japan, and Korea, constitute a group that faces the greatest challenge in learning English (Lee, 2006). One commonality that the main languages of these three countries share is that they use scripts that are radically different from the Roman alphabet (Taylor & Taylor, 1995). This means that it is more difficult for East Asian ESL learners to read English than it is for, say, Spanish speakers. Furthermore, the grammar system of each of these three languages is very different from that of English, which is an Indo-European language. East Asian ESL learners’ English reading processes and skills, therefore, may differ from those used by individuals whose native languages are Indo-European.
Another distinct feature of these three East Asian countries (i.e. China, Japan, and Korea) is their English instruction and testing practice. The Chinese historical civil service exam system (Suen & Yu, 2006) and the ‘Confucian-heritage culture’ (Biggs, 1996, p. 46) continue to influence the schooling practice of some East Asian countries such as Korea, Japan, and Vietnam. Currently, English language tests are used as gate-keeping devices for access to general employment and higher education in those countries (Cheng, 2008; Ross, 2008). Traditionally, the teaching of English in East Asian countries has been dominated by a test-oriented, book-centered, grammar-translation method (Rao, 2001), which emphasizes rote memorization rather than communication and higher-level thinking skills. Those distinctive teaching and learning styles may influence East Asian ESL learners’ reading skills and strategies. For instance, according to Abbott (2006), compared to Arabic ESL learners, ESL learners from China had an advantage in terms of extracting explicitly stated information in reading due to their intensive training with bottom-up reading skills (e.g. the skills focused on word meaning, syntax, or text details), even though they were likely to find some higher-order reading skills (e.g. the skills focused on the gist of a text, background knowledge, or discourse organization) to be challenging. Therefore, the teaching and learning styles of East Asian countries may shape their ESL learners’ reading in different ways than they do for Indo-European ESL learners.
The purpose of this study is to examine native language group differences in the subskills of reading as represented in the Michigan English Language Assessment Battery (MELAB) reading test, in order to provide group-based information for second-language reading instruction for students from different language backgrounds. Developed by the English Language Institute at the University of Michigan (ELI-UM), the MELAB is used to evaluate the advanced-level English language competence of adult nonnative speakers who will use English for academic purposes in a university setting. Designed to assess examinees’ understanding of college-level reading texts, the MELAB reading section consists of four passages, each of which is followed by five multiple-choice items. One native language group examined in this study consisted of individuals whose native languages are Chinese, Korean, or Japanese. This group is referred to as the ‘East Asian’ group. The other group consisted of individuals whose native language is one of the Romance languages, referred to as the ‘Romance’ group. Simply comparing the English reading subskills of these two groups may not yield much information, because examinees whose first language is a Romance language usually perform better on English reading tests than examinees whose first language is an East Asian one. Therefore, the comparison was conducted via a differential skill functioning (DSF) approach by controlling for the overall English reading ability of both groups.
Overview of the DSF approach
The DSF analysis is an extension of traditional differential item functioning (DIF) methods. DIF occurs when examinees from different groups show different probabilities of success on an item after being matched on the underlying ability the test is intended to measure (Camilli & Shepard, 1994). The DIF method was originally developed to detect possible item bias across subgroups. It has been pointed out that multidimensionality may be the cause of DIF (e.g. De Ayala, Kim, Stapleton, & Dayton, 1999; Klieme & Baumert, 2001), and that DIF occurs when one of the unintended additional dimensions reflected by a score is related to race/ethnicity, gender, or other demographic variables. However, there is no particular reason for DIF analyses to focus on these demographic variables exclusively. For instance, Corter and Tatsuoka (2002) conducted a DIF analysis with TIMSS-R 1999 math items across different countries. They concluded that certain items demonstrated DIF not because they are biased but because they involve construct-relevant skills that are mastered differently in different countries. Therefore, it has been suggested that DIF analyses could be used to explain group strengths and weaknesses (Dogan, Guerrero, & Tatsuoka, 2005; Klieme & Baumert, 2001).
However, differences in group strengths and weaknesses rarely occur at the item level alone. Yet, the unit of analysis of DIF is the item, making it not very useful for the study of these differences. As an alternative, DSF analyses have been proposed. The term DSF can be traced back to Milewski and Baron (2002) who extended the DIF procedure to individual performance on skills measured by the Preliminary SAT/National Merit Scholarship Qualifying Test (PSAT/NMSQT) in order to compare aggregate groups, such as schools or states, to the total population matched on overall scores. In the Milewski and Baron approach, a modified Rule Space Model (DiBello, 2002) was used to classify examinees into skill mastery patterns associated with different cognitive skills. Next, the cognitive skill profiles of different groups were compared when their overall ability was controlled for. Alternatively, Gierl, Zheng, and Cui (2008) described how to use the attribute hierarchy method (AHM; Leighton, Gierl, & Hunka, 2004) in order to evaluate differential group performance at the cognitive attribute (or skill) level. Gierl et al. (2008) named this approach attribute-level differential functioning (ADF) and explained that ‘ADF occurs when examinees, with the same matching attribute pattern but from different groups, have unequal probabilities of responding to items that measure the studied [skill]’ (p. 73). Although the logic and approach of DSF are similar to those in DIF analyses, the target, or the unit of analysis, is a skill reflected by a number of items, rather than by a single item.
For the purpose of this study, the DSF analyses of MELAB subskill differences between East Asian examinees and Romance examinees were carried out in three steps. In the first step, specific hypotheses were generated regarding how the two groups would differ at the subskills level of reading based on ESL students’ think-aloud verbal reports. Second, each examinee’s profile on each reading subskill underlying the MELAB reading test was identified by applying the Fusion Model (Hartz, 2002) to the response data of the examinees in the two groups. Third, the hypotheses were tested by comparing the subskill profiles of the East Asian examinees against the subskill profiles of examinees with a Romance language background via DSF analyses through logistic regression techniques (Swaminathan & Rogers, 1990). As a result of the DSF analyses, some recommendations were provided for ESL reading instruction.
Calibration of MELAB reading items
Before generating hypotheses regarding subskill differences between the two groups and estimating subskill profiles of examinees in the two groups based on their performance on the MELAB, it was necessary to identify the subskills involved in the MELAB reading test in the form of a Q-matrix, which is a two-dimensional table of specifications identifying the subskills required for successfully answering each particular item (Tatsuoka, 1983). The MELAB Form E, which consists of 20 reading items, was used in this study. The Fusion Model, also known as the Reparameterized Unified Model (RUM), was applied to a MELAB dataset with 2019 examinees.
The Fusion Model is an IRT-like multidimensional model that expresses the stochastic relationship between item responses and underlying skills. The biggest advantage of the Fusion Model over other cognitive diagnostic models (CDMs; DiBello & Stout, 2007; Roussos, DiBello, Henson, Jang, & Templin, 2010; Rupp & Templin, 2008) is that it acknowledges the incompleteness of the Q-matrix and compensates for this by including the residual parameter ci, which represents all the other skills that have been used by the examinees but have not been specified in the Q-matrix (Hartz, 2002; Roussos et al., 2007). As we do not have a full understanding of the cognitive processes underlying reading comprehension (Lee & Sawaki, 2009a), it is impossible to be certain that we have identified all the skills necessary to correctly answer an item. The inclusion of the residual parameter admits this practical limitation. Furthermore, the Arpeggio program (Bolt et al., 2008) helps to modify the Q-matrix by removing nonsignificant item parameters, thereby facilitating the process of building a valid Q-matrix. Given the complexity of reading comprehension, the Fusion Model has great potential for conducting cognitive diagnostic analyses with reading tests and has been used in a number of studies (e.g. Jang, 2009; Kim, 2011; Lee & Sawaki, 2009b).
Gao and Rogers (2010) developed a model of the cognitive processes used by examinees taking the MELAB reading test and validated the model with the tree-based regression (TBR; Sheehan, 1997). This investigation provides useful information for studying the diagnostic potential of the MELAB reading test. With reference to Gao and Rogers’ cognitive model as well as the cognitive model proposed for TOEFL iBT reading by Jang (2009), we developed a tentative Q-matrix for the MELAB reading test. This tentative Q-matrix was further verified and modified based on data collected from multiple sources, including think-aloud protocols from 13 ESL students and ratings provided by four reading experts in the pilot study. The resulting Q-matrix was then validated via an application of the Fusion Model using a MELAB Form E dataset with 2019 examinees. Originally five subskills were identified as underlying the 20 items. However, only two items require the subskill of making inferences. The two items were judged to provide insufficient information for estimating the subskill of making inferences. Alternatively, the subskill of making inferences could be merged with the subskill of connecting and synthesizing to make a broader skill category. However, the former requires readers to speculate beyond the text using background and/or topical knowledge, while the latter is to connect and synthesize text-based information. Therefore, due to the idiosyncratic nature of these two skills, we decided not to merge them. Hence, the subskill of making inferences and the affiliated two items, item 5 and item 10, were dropped from subsequent analyses. Through the procedure described above, four main subskills were found to underlie the MELAB reading test, that is, vocabulary, syntax, extracting explicit information, and connecting and synthesizing. Table 1 provides details of the four subskills.
Subskills of reading as represented in the MELAB Reading Test.
Furthermore, based on the preliminary Fusion Model calibration results, five Q-matrix entries were dropped because the corresponding skills were found not significant for solving the items (Hartz, 2002). The convergence of the Fusion Model calibration using this refined Q-matrix was deemed excellent. For example, the time–series chain plots and density plots of the parameters did not show noticeable trends or fluctuations (Sinharay, 2004). All the parameters also met the Heidelberg–Welch diagnostic and Geweke Z convergence criteria (Ntzoufras, 2009). Finally, the fit of the Fusion Model calibration using the refined Q-matrix was examined. There was negligible difference, if any, between the model-predicted values and the observed values in terms of item p-values and total scores. Also, there was a substantial difference between proportion-correct scores for item masters and item non-masters on an item-by-item basis. To summarize, the above evidence shows that the model fit the data reasonably well when the refined Q-matrix was used (see Li, 2011 for more details).
To summarize, multiple sources of evidence, including literature review, students’ think-aloud verbal reports, and expert ratings, were used to build the initial Q-matrix. For more reliable estimation and statistical parsimony, the initial Q-matrix was refined; that is, items 5 and 10 and the associated skill of making inferences, as well as the five nonsignificant Q-matrix entries, were dropped. Supported by substantive and empirical evidence, the refined Q-matrix was adopted as the final Q-matrix. This final Q-matrix, as shown in Table 2, involves four subskills and 18 items. The number ‘1’ indicates that the subskill is required by the item, whereas ‘0’ indicates that the subskill is not required by the item.
Q-matrix used for the Fusion Model calibration.
With the Q-matrix and the response data as the input for the Fusion Model calibration, each of the 2019 examinees’ profile across these four subskills was estimated; that is, the posterior probability of mastery (PPM) was produced to indicate the probability that an examinee would be a master of the subskill being studied. This continuous PPM can also be classified into polytomous subskill mastery status.
DSF analyses of group subskill differences
Step 1: Hypotheses generation on subskills differences
After the subskills involved in the MELAB had been identified and validated, specific hypotheses regarding how the two language groups were expected to differ on these four subskills were generated with think-aloud protocols based on a grounded theory orientation (Glaser & Strauss, 1967; Strauss & Corbin, 1998). It is known that ESL learners from different native language groups show different patterns in their reading processes and skills (Koda, 2005). Also, the particular teaching and learning styles in East Asian countries may shape their ESL learners’ reading in different ways than do the particular teaching and learning styles experienced by ESL learners with a Romance language background. However, there is insufficient knowledge regarding exactly how these groups differ on the four particular subskills involved in the MELAB reading test. Therefore, think-aloud protocols from six ESL learners with an East Asian language background and four ESL learners with a Romance language background were analyzed with a grounded theory orientation. With the purpose of building theory from data when the theory is either unavailable or insufficient, the grounded theory approach comprises reading (and re-reading) a textual database (such as field notes and interview transcripts) and discovering or labeling variables (e.g. categories, concepts, and properties) and their interrelationships (Bryant & Charmaz, 2007).
Table 3 briefly summarizes the background of the 10 ESL participants. Their reading ability roughly fell into two categories, high or low, based on their English learning background, previous English test scores, and performance on the MELAB reading test.
Participants’ background information.
We read through the think-aloud transcripts of the two groups in order to see whether patterns or differences in their English reading processes associated with their native languages would emerge. First, the transcripts were open-coded with the purpose of identifying incidents and understanding processes. The questions that guided this initial coding were ‘What is happening here? How did he/she get this item right or wrong? How are the reading processes related to his/her native language?’ We added brief comments in the wide margins of the transcript. Second, we reviewed the portions of the transcript on which we had written comments again in a more general way in order to understand any patterns that the participant had shown in answering the questions. Third, constant comparisons (Glaser, 1978) were made within and across the native language groups.
Both concurrent and retrospective think-aloud protocols were adopted in this study (Ericsson & Simon, 1993). The following provides a snapshot of one piece of evidence indicated in the transcripts (see Li, 2011 for details of the qualitative evidence for all the four subskills). Eva, a native Spanish speaker with relatively low English reading ability, frequently referred to Spanish cognates or Latin roots, stems, suffixes, and prefixes for the purpose of recognizing English words. Sometimes, the English word was very similar to its Spanish equivalent. For instance, she said ‘I know granite because it’s very similar in my language. It’s a kind of rock.’ And, frequently, she was able to recognize a word based on her knowledge of morphology and Latin. For instance, she commented on the word unpalatable: ‘It’s from Latin. Un- means non-. This is the word for this part about mouth in my language. Palate is very similar in Spanish. It means eat. So I guess this word means not possible to eat. All of the technical words are always very similar, because they come from Latin.’ Ted, a Chinese male, was very troubled by some unknown technical words in passage 3, despite the fact that his overall English reading ability was higher than that of Eva. While reading passage 3, he frequently complained about the unknown words: ‘I hate this passage, too many new words, I wasn’t really sure.’ In contrast to Eva, who could get the meaning of granite from her native language, Ted relied on his memory: ‘Maybe it is kind of rock. I may have recited this word before while preparing for GRE. Anyway, I am not sure now.’ It seemed that he was drawing the meaning of the word from his memory instead of identifying the meaning during the reading process. The meaning of granite was critical to answering item 15, and thus Ted was very hesitant about his answer. Compared to Eva, who drew extensively from her native language, Ted was at a disadvantage with word recognition, though his English reading ability was in fact higher than that of Eva.
As the study proceeded, we wrote memos and journals in order to capture, define, and summarize the differences between and the commonalities among the participants. External audits and peer audits were consistently conducted during the grounded theory study to safeguard the data analysis and interpretation. We continued the data analysis until we were confident about the hypotheses that had emerged from the data. Overall, it could be observed and thus hypothesized that, given the same English reading ability, in comparison with their Romance counterparts, East Asian ESL learners performed less well in linguistic subskills, such as vocabulary and syntax, but better in comprehension subskills, such as extracting explicit information and connecting and synthesizing.
Finally, the generated hypotheses were validated against the reading literature. As recommended by Glaser (1998), the researcher must have an open attitude to the research question, so that the generation of theory is not compromised by a researcher’s preexisting views but directly emerges from the data. Therefore, Glaser insists that the researcher should not review the literature until the emerging theory has developed sufficiently based on the data. An extensive post-literature-review analysis was thus conducted to validate the generated hypotheses, and the important question of ‘Do the hypotheses make sense?’ was asked.
Cross-linguistic studies have shown that East Asian ESL learners may have disadvantages with linguistic subskills owing to the fact that their native languages are very different from English (Bates, Devescovi, & D’Amico, 1999; Juffs, 1998; Koda, 2005). This agrees with our observation through the think-aloud verbal reports that the East Asian students seemed to have more difficulty with vocabulary and syntax than the Romance-language students. In addition, as stated by Stanovich (1980, p. 47), ‘knowledge sources at all levels contribute simultaneously to pattern synthesis and a lower-level deficit may result in a greater contribution from higher-level knowledge sources.’ This is in alignment with the phenomena observed in the think-aloud protocols that East Asian ESL learners seemed to rely more on their comprehension subskills to offset their relatively deficient linguistic subskills in order to achieve the same overall reading performance as their Romance-language counterparts. The generated overall hypothesis appeared sensible, workable, and trustworthy.
We found that high-level participants usually showed less difficulty in reading and thus produced fewer verbal reports. Although it was more difficult to examine subskill differences for high-level participants due to the smaller amount of data, the think-aloud verbal reports showed that the observed native language group differences were generally consistent for both low and high level participants. Neither the reading literature nor results of the think-aloud protocols provided any discernible rationale for expecting the existence of non-uniform DSF (cf. Mellenbergh, 1982) in the four subskills studied. Therefore, only uniform DSFs were expected. The overall hypothesis can be broken down into the following four specific hypotheses:
Hypothesis 1: There is DSF for the subskill of vocabulary favoring the Romance group.
Hypothesis 2: There is DSF for the subskill of syntax favoring the Romance group.
Hypothesis 3: There is DSF for the subskill of extracting explicit information favoring the East Asian group.
Hypothesis 4: There is DSF for the subskill of connecting and synthesizing favoring the East Asian group.
Additionally, the literature has shown that female students are generally better readers than male students (e.g. Logan & Johnston, 2009), though how gender differences relate to the specific subskills of reading is not clear. Thus, in order to be certain that the observed DSF was not attributable to gender, gender was also taken into account as an additional controlling variable in all subsequent DSF analyses.
Step 2: Extracting examinee subskill mastery profile
Of the 2019 examinees in the MELAB dataset used for the earlier Fusion Model calibration, 522 examinees had an East Asian language background (i.e. 410 Chinese, 84 Korean, and 28 Japanese), whereas 147 examinees had a Romance language background (i.e. 75 Spanish, 37 Romanian, 21 Portuguese, 12 French, and 2 Italian). The subskill profile of each of these 669 examinees (i.e. the PPM of each subskill for each examinee) was drawn from the results of the Fusion Model calibration with the large MELAB dataset. As shown in Figure 1, most of the examinees had either a very high or a very low PPM, with few examinees in the middle range. This indicates that the examinees could easily be classified as masters or non-masters of the skills (Lee & Sawaki, 2009b).

Distribution of the PPM.
Table 4 demonstrates the descriptive statistics of the PPM across the East Asian group and the Romance group. The average PPM for the Romance group was generally higher than that for the East Asian group across all four subskills. In addition, we can determine examinee mastery status by evaluating the PPM of an examinee against a predetermined cut-off PPM value. In accordance with Hartz (2002) and Roussos et al. (2007), a cut-off PPM criterion of .5 was used to reach a dichotomous mastery status for each examinee on each subskill, that is, non-master if PPM < .5, and master if PPM > .5. The final column of Table 4 shows the percentage of masters of each subskill across the two groups. A more refined polytomous classification scheme can also be used, with .4 and .6 as cut-off points (e.g. Jang, 2009). However, in the present study, less than 7% of examinees had PPMs between .4 and .6 for all the four subskills; therefore, the more refined polytomous status would not have changed the classification results much. In addition, the polyserial correlation between the number of subskills that an examinee had mastered and the examinee’s total test scores was .957. This indicates that examinees who had mastered more subskills also had higher test scores.
Descriptive statistics of the PPM.
Step 3: DSF analyses via logistic regression
The traditional logistic regression method for DIF analyses generally uses the total test score as the matching variable. As cautioned by Zhang (2006), using the total score as the matching variable may not be feasible when the test is calibrated by a multidimensional cognitive diagnostic model. She proposed matching examinees on their skill profile patterns instead. However, when many subskills are involved in a test, matching on profile patterns may not be practical. Take the MELAB reading test for example: with four subskills involved in the test, examinees could have as many as 16 (i.e. 24) skill profile patterns. In addition, some skill profile patterns may have far fewer examinees than others due to the different difficulty levels of the subskills (Lee & Sawaki, 2009b). Given that the sample size of the current DSF study was only 669, it is not practical to match examinees on 16 skill profile patterns. Therefore, the total score was used as the matching variable in this study. Figure 2 shows the frequency distribution of the two groups’ total scores. The mean and standard deviation for the East Asian group (N = 522) were 10.00 and 4.08, whereas the mean and standard deviation for the Romance group (N = 147) were 11.25 and 3.883.

Distribution of total scores.
Another consideration is whether to purify the total score. In traditional DIF, this is done by removing the item identified as exhibiting DIF from the conditioning total score (Clauser, Mazor, & Hambleton, 1993). Gierl, Zheng, and Cui (2008) extended the purification process to DSF by using only those subskills independent of the studied subskill as matching subskills. Though theoretically appealing, with this approach the matching variable itself may change dramatically as the number of subskills is usually small compared to the number of items. For instance, when only four subskills are involved in a test, as is the case with the MELAB, to compare different groups’ performance on subskill 1, examinees are matched on subskills 2, 3, and 4. Then, for subskill 2, they are matched on subskills 1, 3, and 4. There is an obvious and substantial change of the matching variable from one analysis to another, which makes interpretation difficult. The DSF analysis in this study is designed to test the reading subskill differences of native language groups under the condition that they have the same overall English reading ability; on this basis, it is important to have a stable proxy for overall English reading ability. Therefore, the matching variable was not purified in this study.
Finally, owing to its new status, the DSF studies do not have a generally agreed-upon sample size requirement yet. However, given the technical equivalence between DSF and DIF, the sample size requirement for DIF was referred to in the present study. Regarding DIF studies using logistic regression, a sample size of 200 per group is generally suggested to insure adequate power and to avoid inflated Type I error (Mazor, Clauser, & Hambleton, 1992; Zumbo, 1999), though smaller sample size such as 100 has been regarded as adequate (Lai, Teresi, & Gershon, 2005). The sample size in the present study, 147 for Romance group, 522 for East Asian group, is not particularly large but considered moderately sufficient.
In order to be certain that the observed DSF was not attributable to gender, a two-stage DSF procedure was conducted in this study. In the first stage, only the total score was entered as the internal matching variable. In the second stage, gender was entered as an additional external matching variable.
Total score as the matching variable
The total score was used as the internal matching variable in order to determine whether the two native language groups had different performance on the subskills of reading given the same overall English reading ability. As shown in the following equations with subskill i (i from 1 to 4) as the example, only the total score was entered as a predictor in Model 1. Then the language group variable was added as an additional predictor to Model 2. If the −2 log-likelihood difference between Model 1 and Model 2 is larger than a χ2 value with 1 degree of freedom, DSF exists. A statistical significance level of .05 was used in the study for DSF judgment.
Model 1 Masteryi = Total score
Model 2 Masteryi = Total score + Language
Total score and gender as matching variables
In order to be certain that the observed DSF was not attributable to gender, gender was entered as an external conditioning variable in addition to the internal conditioning variable of the total score. As shown in the following equations with subskill i (i from 1 to 4) as the example, the total score and gender were entered as predictors in Model 1. Then the language group variable was added as an additional predictor to Model 2. If the −2 log-likelihood difference between Model 1 and Model 2 is larger than a χ2 value with 1 degree of freedom, DSF exists.
Model 1 Masteryi = Total score + Gender
Model 2 Masteryi = Total score + Gender + Language
Results of the DSF analyses
Existence of the DSF
Total score as the matching variable
Table 5 shows the −2 log-likelihood difference between the two models for each of the four subskills examined. The last column shows the −2 log-likelihood difference between Models 1 and 2. A difference larger than the critical value of chi-square with 1 degree of freedom (i.e. χ2 (1, .05) = 3.84) indicates evidence of DSF. For the subskill of vocabulary, the −2 log-likelihood difference was found to be 7.742, which is larger than the critical value of 3.84. For syntax, extracting explicit information, and connecting and synthesizing, however, the −2 log-likelihood differences were all smaller than 3.84. That is, DSF existed only in the subskill of vocabulary.
Summary of −2 Log-likelihood differences of Stage 1 analysis.
Note: * Larger than the critical value of χ2 (1, .05) = 3.84.
Total score and gender as matching variables
As shown in the last column of Table 6, for the subskill of vocabulary, the −2 log-likelihood difference was 7.751, which is larger than 3.84. This indicated that DSF still existed for the subskill of vocabulary when gender was controlled for. For the subskills of syntax and extracting explicit information, the −2 log-likelihood differences were smaller than 3.84 in both cases. However, for the subskill of connecting and synthesizing, the −2 log-likelihood difference was now 4.202, larger than 3.84. Therefore, DSF existed for the subskill of connecting and synthesizing when gender, in addition to the total score, was controlled for.
Summary of −2 log-likelihood differences of Stage 2 analysis.
Note: * Larger than the critical value of χ2 (1, .05) = 3.84.
Detailed findings and interpretation
There were statistically significant differences regarding the subskill of vocabulary and the subskill of connecting and synthesizing, with substantial practical significance for both. Although no DSF effect was found between the two language groups on the subskill of syntax, a significant gender effect was found. However, there was insufficient evidence to support the hypothesis that there is a difference in the subskill of extracting explicit information between the East Asian group and the Romance group. Details of the findings are provided below.
Vocabulary
As shown in Table 7, when only the total score was used as the matching variable, language group was a statistically significant predictor of the subskill of vocabulary, with a p-value of .007 and an odds ratio of 2.971 (i.e. Exp (β) = 2.971). When gender was controlled for in addition to the total score, Table 8 shows that language group was still a statistically significant predictor with a p-value of .006 and an odds ratio of 2.986 (i.e. Exp (β) = 2.986); however, gender itself was not a statistically significant predictor. Overall, given the same overall English reading ability, the odds that the Romance group would have mastery of the subskill of vocabulary was about three times as large as the odds for the East Asian group regardless of gender. Hypothesis 1, which postulated that there is DSF for vocabulary favoring the Romance group, was supported.
Regression coefficients for vocabulary when matched on total scores.
Note: * p < .05.
Regression coefficients for vocabulary when matched on total scores and gender.
Note: * p < .05.
This result finds its support in the literature, that is, linguistic skills can be transferred from a person’s first language (L1) to that person’s second language (L2): the closer an ESL learner’s L1 and L2 are, the easier it is to transfer L1 skills to that learner’s L2 skills (Fries, 1945). Due to the influence of Latin, English and Romance languages share many linguistic features. One such distinctive commonality is the use of Roman alphabets, in which letters represent phonemes. Additionally, a large portion of English words today are based on or derived from Latin words. However, Chinese characters, Korean Hanja, and Japanese Kanji belong to logographic systems, in which symbols map onto morphemes. Although Korean Hangul is alphabetic, it does not use Roman alphabets and requires assembling individual symbols into syllable blocks (Taylor & Taylor, 1995). Additionally, with the existence of a large number of homonyms, the meaning of specific Hangul words often needs to reference back to Hanja. Thus, due to the differences between how words are recognized in East Asian languages as compared to English, East Asian ESL learners are likely to find recognizing English words more difficult than Romance ESL learners.
Syntax
Language group was not a significant factor whether or not gender was controlled for. However, as shown in Table 9, gender was a statistically significant predictor, with a p-value of .007 and an odds ratio of 2.199 (i.e. Exp (β) = 2.199). In order to further examine the effects of gender, as shown in Table 10, the language group variable was removed from the logistic regression. The results showed that gender remained a statistically significant predictor with a p-value of .009 and an odds ratio of 2.153 (i.e. Exp (β) = 2.153). To summarize, given the same overall English reading ability, the odds that the female ESL learners would have mastery of the subskill of syntax was about twice as large as the odds for male ESL learners regardless of native language. However, hypothesis 2, which postulated that for the subskill of syntax there is DSF favoring the Romance group, was not supported.
Regression coefficients for syntax when matched on total scores and gender.
Note: * p < .05.
Regression coefficients for syntax when language group was removed.
Note: * p < .05.
It needs to be pointed out that the syntax difference between the two groups’ native languages is not as clear-cut as the vocabulary difference. For instance, both Chinese and English primarily have Subject–Verb–Object as the word order, though Chinese relies less on word order than English does. If only Korean and Japanese ESL learners, whose native languages have Subject–Object–Verb as the word order, had been compared to Romance ESL learners in the DSF analysis, a clearer difference regarding syntax may have been detected. Another possible reason is the intensive training in grammar that East Asian learners receive during their English learning and test preparation. The focused training may have more than compensated for East Asian learners’ initial disadvantages in mastering English syntax.
Extracting explicit information
It is hypothesized that East Asian learners were more skilled at extracting explicit information at the local level than Romance learners. However, as shown in Table 11, language group was not a statistically significant predictor. The negative β coefficient (i.e. −.147) for language group indicated a potential trend that given the same overall English reading ability, the East Asian group was more likely to have mastery of extracting explicit information than the Romance group. However, there is insufficient evidence to conclude any significant language group difference. When gender was further controlled for, language group was not statistically significant either, and gender itself was not statistically significant as well. To summarize, hypothesis 3, which postulated that for the subskill of extracting explicit information there is DSF favoring the East Asian group, was not supported.
Regression coefficients for extracting explicit information when matched on total scores.
Note: * p < .05.
Connecting and synthesizing
As shown in Table 12, when only the total score was controlled for, language group was not a statistically significant predictor. However, when gender was also controlled for, as shown in Table 13, language group became a statistically significant predictor, with a p-value of .043 and an odds ratio of .507 (i.e. Exp (β) = .507). This indicates that given the same overall English reading ability and gender, the odds that the Romance group would have mastery of the subskill of connecting and synthesizing was only half the odds for the East Asian group.
Regression coefficients for connecting and synthesizing when matched on total scores.
Note: * p < .05.
Regression coefficients for connecting and synthesizing when matched on total scores and gender.
Note: * p < .05.
In addition, as shown in Table 13, gender itself was also a statistically significant predictor, with a p-value of .024. In order to further examine the effect of gender, language group was removed from the logistic regression. As shown in Table 14, gender remained a significant predictor, with a p-value of .031, when language group was removed from the analysis. Specifically, given the same overall English reading ability, the odds that female ESL learners would have mastery of the subskill of connecting and synthesizing was about 1.9 times (i.e. Exp (β) = 1.875) as large as the odds that male ESL learners would have mastery of this subskill regardless of native language group.
Regression coefficients for connecting and synthesizing when language group was removed.
Note: * p < .05.
The subskill of connecting and synthesizing is a very broad category. Had a more finely grained category of subskill been defined, such as ‘identifying the main idea,’ the nature of these group differences might have been clearer. Connecting and synthesizing has been regarded as a relatively higher-level reading skill. However, as observed by Alderson (1990), readers with poor performance on items requiring lower-level skills did not necessarily fail to answer items requiring higher-level skills. In a response to Alderson’s study, Matthews (1990) argued that items requiring lower-level skills would probably be more difficult than items requiring higher-level skills, because the latter usually relate to a long passage of text and thus may be easier for poor readers to understand. Jang (2009) also found that the skill of summarizing had the largest number of masters among the nine skills involved in the diagnostic analysis of TOEFL iBT reading. In the present study, the think-aloud verbal reports indicated that the skill of connecting and synthesizing was less challenging for East Asian ESL learners, probably because they could resort to a large section of the text for information, even though they were challenged by the skill of vocabulary.
To summarize, hypothesis 4, which postulated that for the subskill of connecting and synthesizing there is DSF favoring the East Asian group, was supported when gender was controlled for in the DSF analysis. In addition to language group differences, gender differences also emerged. Overall, it seems that female ESL learners with an East Asian language background were more skilled than other learners at connecting and synthesizing information.
Implications for instructional strategies
The purpose of the DSF is to compare group performance on certain subskills within the framework of cognitive diagnostic assessment in order to understand the relative weaknesses and strengths of examinees from different groups. The following section shows the example of some instructional strategies for addressing specific weaknesses in ESL learners’ reading subskills that have been observed in this study. It is important to note that a variety of instructional strategies are available, and the selection should be based on learner characteristics and learning environments.
Vocabulary
Studies have shown that lack of vocabulary skills is the principal obstacle in reading comprehension (e.g. Adams, 1990, 1999; Juel, 1988; Perfetti, 2007). East Asian ESL learners are especially challenged by English vocabulary due to the vast difference between the writing system of their native languages and that of English. It is found that many East Asian students rarely acquire words incidentally (Cho, 2004; Hui, 2004), that is, ‘through exposure when one’s attention is focused on the use of language, rather than the learning itself’ (Schmitt, 2000, p. 116). As a result of the lack of such experience, their vocabulary knowledge tends to be limited; that is, they tend not to fully understand the usage or connotations of a word (Gui, 2004). In light of the findings regarding vocabulary in this study and the literature in vocabulary acquisition, extensive reading and increasing phonological awareness are especially recommended to help improve East Asian ESL learners’ vocabulary skill.
ESL reading instruction in East Asian countries tends to focus on intensive reading (Powell, 2005). With intensive reading, readers take a text, study it line by line, and refer frequently to a dictionary in order to understand the grammar and vocabulary of the text (Hafiz & Tudor, 1989; Palmer, 1917). Extensive reading, on the other hand, is different in that students read a large amount of longer, easily understood materials relatively fast, mostly out of the classroom and according to their own pace and schedule. Extensive reading is beneficial for ESL students’ reading proficiency, especially in vocabulary learning (Horst, 2005; Stanovich, 1986). It also has the potential to train ESL students to become proficient at acquiring vocabulary on their own (Krashen, 1981). Considerably greater exposure to authentic reading in English would help students in ‘overcoming the many L1–L2 differences that exist for L2 reading development’ (Grabe, 2009, p. 150). This would be especially helpful for East Asian ESL learners.
One reason for East Asian ESL students’ difficulty with English-word recognition is their lack of phonological awareness. East Asian ESL learners have been found to be less sensitive to phonological information in English-word recognition, compared to those with a Roman alphabetic L1 background (e.g. Brown & Haynes, 1985; Koda, 1990). As indicated by Baddeley (2006), storage, rehearsal, and reinforced memory of new words in phonological form in the working memory is the foundation of all vocabulary learning. Children’s phonological awareness is regarded as an important and reliable predictor of their later L1 reading ability (Ehri et al., 2001) as well as L2 reading ability (Bernhardt, 2011; Grabe, 2009). Different strategies are available for increasing East Asian ESL students’ phonological awareness in order to help them achieve more effective word recognition. One option is through explicit classroom instruction (Archer & Hughes, 2011). ESL instructors may use some tasks to help students improve their ability in this regard, such as phonological oddity, deletion and substitution, and segmentation activities (Anthony, Lonigan, Driscoll, Phillips, & Burgess, 2003). In addition, oral reading, or reading aloud after class, helps students build phonological awareness on their own.
Connecting and synthesizing
The DSF analysis shows that females with an East Asian language background are more likely to perform well on the subskill of connecting and synthesizing than learners in other groups. Explicit instruction of text structure is one of the instructional strategies effective for developing the skill of connecting and synthesizing. Text structure strategy focuses on helping readers understand how the information in a text is organized (Taylor, 1992). For example, five basic types of expository rhetorical organization have been identified: comparison, problem-and-solution, cause-and-effect, sequence, and description (Meyer, 1985). A clear understanding of the text structure may lead to improvements in connecting and synthesizing information as well as in overall reading comprehension.
Many approaches are available for teaching students text structure strategy. For example, Meyer and her colleagues have taught students to use signaling words to help them recognize the different structures of expository texts (e.g. Meyer, 1985; Meyer & Poon, 2001; Meyer et al., 2010). Other researchers have proposed teaching students text structure strategy by using a graphic organizer (e.g. Berkowitz, 1986) and writing hierarchical outlines (e.g. Taylor & Beach, 1984). Moreover, students can be taught to use headings, subheadings, and topic sentences in order to understand the structure of a text (Seidenberg, 1989). Most of the above-cited studies investigated L1 reading. However, given the similarities between L1 and L2 reading and findings on the positive effects of using text structure strategy in L2 reading (Carrell, 1985), explicit instruction of text structure is a promising strategy for strengthening ESL learners’ ability to connect and synthesize information and for developing their overall reading comprehension. In light of the findings of this study, the explicit instruction of text structure may be most effective for male students with a Romance language background.
Summary and limitations
This study has demonstrated a procedure of detecting group differences at the subskills level using a DSF approach within the cognitive diagnostic framework. With this approach, a more insightful and detailed analysis of group differences is conducted at the subskill level. The resulting information can provide important guidance to classroom instruction and learning. Also, although using similar statistical procedure, DSF is different from DIF in that it aims to study language group differences in order to facilitate instruction and learning. Therefore, the existence of DSF does not indicate bias in any sense but only reflects the cognitive differences among different groups underlying the overall construct being tested.
In order to extract examinee skill profile, the Fusion Model was retrofitted with the MELAB, which is originally an English proficiency test. A noticeable indeterminacy in the cognitive diagnostic analysis is the grain size of the subskills (Jang, 2009; Lee & Sawaki, 2009a; Sawaki, Kim, & Gentile, 2009). The more skills that are identified, the richer is the diagnostic information that can be provided; however, including a large number of skills places stress on the capacity of statistical modeling, given the fixed length of a test. Gao and Rogers (2010) suggested over 10 reading skill components underlying the MELAB reading test. However, the MELAB reading test consists of only 20 items. Due to the complexity of the Q-matrix construction and concerns about the limited capacity of statistical modeling, only skills that are of substantial importance in correctly answering the items were considered for the diagnostic analysis in this study. The resulting number of skills involved in the cognitive diagnostic analysis was thus small. In order to overcome this limitation, the best approach is to design a reading test with a clear cognitive structure and relatively more items in order to receive more refined estimates for the subskills of interest (Gierl & Cui, 2008).
The purpose of the DSF is to compare group performance on certain subskills within the framework of cognitive diagnostic assessment in order to understand the relative weaknesses and strengths of examinees from different groups. In this study, a rather broad native language grouping was used for comparison. Different groupings, however, could have been used. For instance, if more Korean and Japanese examinees had been available in the dataset, it would have been possible to explore potential subskill differences within the East Asian group. It is, therefore, important to conduct further studies using different native language groups with different samples to replicate and expand the present study. Furthermore, in addition to the grouping variable of one’s native language, it is not unlikely that the examinees’ other individual differences, such as gender, age, background knowledge, interest, motivation, and engagement, may have influenced the reading process as well (Bernhardt, 2011). Among these, only gender was involved in the present DSF study, because research has shown salient gender differences in reading (Klinger, Shulha, & Wade-Woolley, 2009) and because the MELAB dataset provided a complete record of examinee gender. The DSF is a powerful approach in examining and describing subskill differences among subgroups. However, if the purpose is to establish causality, similar to any other statistical modeling, the DSF cannot replace the role of randomized experimental design.
Footnotes
Acknowledgements
The authors are indebted to Pui-Wa Lei, Bonnie Meyer, Aleksandra Slavkovic, Yong-Won Lee, and Dorothy Evensen for their valuable suggestions and advice throughout this study.
Funding
This study is partially supported by a Spaan Fellowship in Second or Foreign Language Assessment from the English Language Institute at the University of Michigan and the Small Grants for Doctoral Research in Second or Foreign Language Assessment provided by the TOEFL program at the Educational Testing Service.
