Abstract
Investigating the comparability of students’ performance on TOEFL writing tasks and actual academic writing tasks is essential to provide backing for the extrapolation inference in the TOEFL validity argument (Chapelle, Enright, & Jamieson, 2008). This study compared 103 international non-native-English-speaking undergraduate students’ performance on two TOEFL iBT® writing tasks with their performance in required writing courses in US universities as measured by instructors’ ratings of student proficiency, instructor-assigned grades on two course assignments, and five dimensions of writing quality of the first and final drafts of those course assignments: grammatical, cohesive, rhetorical, sociopragmatic, and content control. Also, the quality of the writing on the TOEFL writing tasks was compared with the first and final drafts of responses to written course assignments using a common analytic rubric along the five dimensions. Correlations of scores from TOEFL tasks (Independent, Integrated, and the total Writing section) with instructor ratings of students’ overall English proficiency and writing proficiency were moderate and significant. However, only scores on the Integrated task and the Writing section were correlated with instructor-assigned grades on course assignments. Correlations between scores on TOEFL tasks and all dimensions of writing quality were positive and significant, though of lower magnitude for final drafts than for first drafts. The TOEFL scores were most highly correlated with cohesive and grammatical control and had the lowest correlations with rhetorical organization. The quality of the writing on the TOEFL tasks was comparable to that of the first drafts of course assignment but not the final drafts. These findings provide backing for the extrapolation inference, suggesting that the construct of academic writing proficiency as assessed by TOEFL “accounts for the quality of linguistic performance in English-medium institutions of higher education” (Chapelle, Enright, & Jamieson, 2008, p. 21).
The Test of English as a Foreign Language (TOEFL®) measures, according to the Educational Testing Service (ETS), “the ability of nonnative speakers of English to use and understand English as it is spoken, written, and heard in college and university settings” (ETS, 2007, para.1). First introduced in 2005, the TOEFL Internet-based Test (TOEFL iBT®) is intended to measure four language skills (listening, reading, speaking and writing) essential for effective communication in academic settings and includes test tasks in the speaking and writing sections that require test takers to integrate skills. For example, the TOEFL iBT Writing section requires students to complete a writing task based on what they have heard and read in addition to a traditional Independent writing task, which asks them to create an essay based on a short prompt. The inclusion of this Integrated task type was a response to criticisms of the Independent writing task (e.g., Purves, 1992; Raimes, 1990) and the call for writing tasks that more closely emulate how students are required to perform at the university level (Cumming, Kantor, Powers, Santos, & Taylor, 2000). Thus, the Integrated writing task was designed to resemble those that students will likely encounter in actual academic settings. With its emphasis on integrated skills, the TOEFL iBT is said to be able to provide “better information to institutions about students’ ability to communicate in an academic setting and their readiness for academic coursework” (ETS, 2007, para. 3). Only one study (Riazi, 2016), however, has examined whether the performance elicited by the Integrated writing task actually resembles performance in academic settings.
Meanwhile, in most universities, students’ academic success depends largely on their ability to perform well on various academic writing tasks, because most college and university level courses require term papers and other forms of academic writing as evidence of students’ understanding and mastery of course materials. Educational institutions that rely on TOEFL scores to make admissions decisions are therefore expected to benefit if students’ writing performance on the TOEFL iBT is comparable to how they would perform in writing tasks in university settings.
Addressing this comparability between performance on the TOEFL and the target language use (TLU) domain (Bachman & Palmer, 2010) would provide support for the extrapolation inference in the TOEFL validity argument, which requires support for the statement that “the construct of academic language proficiency as assessed by TOEFL accounts for the quality of linguistic performance in English-medium institutions of higher education” (Chapelle, Enright, & Jamieson, 2008, p. 21). Establishing extrapolation is critical to supporting the claim that an assessment is useful for decision-making. Test users want to know whether a test score really provides information about a student’s performance beyond the test situation. Two approaches have been used to support the extrapolation inference: (1) examining the relationship between TOEFL iBT scores and other indicators intended to reflect students’ performance in the TLU domain; and (2) comparing the quality of the speaking and/or writing performance elicited by TOEFL iBT with that elicited by academic tasks at universities. This study addresses the extrapolation inference in the TOEFL validity argument by investigating the comparability of non-native-English-speaking undergraduate students’ performance on TOEFL iBT writing tasks and on writing tasks assigned in required writing courses in US universities using both approaches. First, the study examines the relationship between students’ TOEFL iBT scores on the Independent task, the Integrated task, the overall Writing section and several indicators of writing performance in their required writing class. Second, the study compares the quality of the writing produced by these students in response to the two TOEFL writing tasks and course assignments using scores on five dimensions of writing quality: grammatical, cohesive, rhetorical, sociopragmatic, and content control.
Comparability of performance on TOEFL iBT and in university settings
Studies that have examined the comparability of performance on TOEFL iBT and university tasks have typically investigated the relationship between test scores and scores obtained on university tasks or task simulations. For example, Sawaki and Nissan (2009) examined the relationship between scores on the TOEFL iBT Listening section and student performance on three complex academic listening tasks. Xi (2008) examined the relationship between scores on the TOEFL iBT Speaking section and scores on local tests used at universities to determine whether candidates’ English was sufficient to teach as International Teaching Assistants. More recently, Ockey, Koyama, Setugochi, and Sun (2015) investigated the relationship between scores on the TOEFL iBT Speaking section and scores on speaking assessments used at a Japanese university. Focusing on writing, Weigle (2010) examined the relationship between scores on the Independent task (those produced via automated scoring and human raters) and various indicators of students’ academic writing performance, including students’ self-assessment, instructor ratings, and independent ratings of their writing in class.
Another approach to investigate comparability is to examine the characteristics of the language elicited by the TOEFL iBT tasks and the characteristics of the language elicited by tasks in the university setting. For example, Brooks and Swain (2014) compared the grammatical, discourse, and textual features of the language students produced in response to the TOEFL iBT Speaking tasks and the language produced in one in-class and one out-of-class speaking activity. Weigle and Friginal (2015) examined the textual features of the writing produced in response to the TOEFL iBT Independent task and those in a corpus of successful college writing across several disciplines using Biber’s multidimensional analysis. Similarly, Riazi (2016) used Coh-Metrix to compare the textual features of the writing produced by graduate students in response to the TOEFL iBT Independent and Integrated tasks and university academic tasks.
Comparability studies focused on writing
The present study builds on and expands three comparability studies that have focused on writing. First, Weigle (2010) examined comparability of scores on the TOEFL iBT Independent essays and those on several indicators of writing proficiency in the university setting in a study designed to investigate the use of automated scoring of TOEFL iBT essays. Three hundred and eighty-six undergraduate and graduate international students at eight universities completed a survey and two Independent tasks on two topics. Students also provided two samples of writing for courses they were enrolled in within their major or from their English composition or ESL courses. The writing samples were scored using a rubric designed for the study that included two broad dimensions: content and language. Participating students’ instructors were also invited to complete a survey about their students’ writing ability. Weigle (2010) found the highest correlations of the TOEFL writing scores with instructors’ assessments, followed by student self-assessments, and then the content and language ratings on the class writing samples. She observed that “the highest correlations of essay scores tended to be with overall measures of global language proficiency rather than specific aspects of writing ability” (p. 349) and thus concluded that “one possible interpretation of these results is that the TOEFL iBT Independent task may be tapping general language proficiency somewhat more than academic writing ability, constructs that are generally considered related but distinct” (p. 349).
Weigle’s study is informative about the Independent writing task, but the Writing section score on the TOEFL iBT takes into account students’ performance on both the Independent and the Integrated task. In addition, the participants in the study included both graduate and undergraduate students enrolled in very different types of courses. The kind of writing expected in these courses ranged from highly technical, disciplinary graduate-level writing to more generic, undergraduate writing for ESL classes. In addition, the writing samples submitted were the result of varying levels of assistance and feedback, so “they are not all strictly products of the individuals who turned them in” (p. 349).
In a second study, Weigle and Friginal (2015) examined linguistic features in writing to compare performance on successful TOEFL Independent essays and performance in a corpus of successful student writing in the university setting. Using Biber’s (1988) multidimensional analysis, TOEFL Independent essays written by native and non-native speakers on two different test prompts were analyzed along four dimensions considered to be characteristic of successful student writing. An example of a dimension they analyzed is the extent to which an essay included “expressions of opinions, attitudes, emotions, and mental processes” (p. 27). Results demonstrated that the writing produced in response to the TOEFL Independent task differed in significant ways from those in disciplinary writing, particularly in the natural and health sciences. Weigle and Friginal (2015) found that “the test essays received higher dimension scores, in some cases quite a bit higher, suggesting that essay prompts such as those on the TOEFL [Independent task] elicit writing that differs in substantive ways from disciplinary writing” (p. 34). Their study, however, only examined the linguistic features of essays produced in response to the Independent task. The authors admit that results may differ if the same analysis were conducted with essays produced in response to the Integrated task. Also, their study did not compare the writing of the same students across the two contexts (TOEFL iBT and university), but rather compared student performance on the Independent task to an existing corpus of student writing.
Like Weigle and Friginal’s (2015) research, a third study by Riazi (2016) focused on comparing characteristics of the writing produced in response to TOEFL writing prompts with characteristics of writing assignments in the university setting. Using Coh-Metrix, he compared the textual features (syntactic complexity, lexical sophistication, and cohesion) of the essays produced by 20 international graduate students in response to the TOEFL iBT Independent and Integrated tasks and to their academic writing assignments. Although he found some differences between the textual features of the Independent and Integrated essays, he generally found a high level of congruence between the textual features of the TOEFL iBT tasks and those of the academic assignments, suggesting that “the combination of the two test tasks have the capability of eliciting similar linguistic and discoursal features as do the academic assignments” (p. 25). Riazi (2016) makes an important contribution as the only study focused on comparability of performance so far that also included the Integrated task. The study also compared the same students’ writing across the two TOEFL iBT and university tasks. The study, however, focused on a relatively small sample of graduate students and examined comparability in terms of discrete linguistic features in the writing.
Unlike some of the prior studies that gathered data from a broad range of students enrolled in a variety of different courses related to their majors, in the present study we aimed to control for some of this variability by focusing on a specific student population, consisting of undergraduate students in their first (or second) year of college, and a specific TLU domain, that is, the required writing courses. Because such courses constitute a graduation requirement for students in many North American universities, they represent an important part of the TLU domain.
In addition, the studies reviewed above did not specify whether the papers collected were first drafts or final drafts, thus making the findings difficult to interpret. A unique feature of this study is that we collected both first and final drafts of course assignments to be able to distinguish what students are able to do on their own from what they can do after one or more rounds of feedback and revision. We are thus able to better distinguish students’ initial writing performance from their writing performance after instruction. Since the purpose of TOEFL iBT is to determine students’ readiness for study in the university setting, comparing students’ performance on the TOEFL writing tasks to their initial writing in course assignments yields more appropriate data to investigate the extrapolation inference. In other words, how students perform on TOEFL writing tasks should provide information about how they will perform in the university setting on their own, prior to receiving feedback and instruction.
Research questions
The purpose of this study was to compare students’ writing performance on TOEFL iBT and in required composition courses. Specifically, we addressed the following questions:
To what extent are scores on the TOEFL iBT Independent task, Integrated task, and Writing section related to the following TLU indicators? writing instructor ratings of students’ writing and of overall English proficiency instructor-assigned grades on course assignments in required writing courses students’ grammatical, cohesive, rhetorical, sociopragmatic and content control in the first and final drafts of course assignments in required writing courses
To what extent is the quality of student writing on the TOEFL-iBT Independent and Integrated tasks and in first and final drafts of course assignments in required writing courses comparable when scored using a common analytic rubric?
Method
Two approaches were used to address the research questions. First, we used correlational analyses to examine the relationships between TLU indicators of writing performance and TOEFL scores (the Independent task, the Integrated task, and the Writing section). Second, we compared the quality of the writing produced in response to the Independent and the Integrated tasks to the quality of the writing produced by the same test takers on the first and final drafts of course assignments. We scored all of the writing samples using a common analytic rubric and compared scores along the dimensions in the rubric using paired sample t-tests.
Participants
Students
Participants included 103 international undergraduate students who are non-native speakers of English enrolled in required writing classes and their instructors at eight universities in the United States. Eighty-two percent (84 out of 103) of students were enrolled in required writing classes designated for ESL/International students, and 18% (19 out of 103) were enrolled in mainstream required writing classes. Both types of courses fulfilled a graduation requirement. Students enrolled in developmental writing classes were excluded from the study. Student participants were in their first (82%) or second (18%) year of study at the university and thus had not yet specialized in any particular field. Nonetheless, all but 14 had selected a major: 14 students in the Humanities and Arts, 18 in the Social Sciences, 22 in STEM, and 35 in Business. Participants identified 25 different countries of origin; half indicated that they were born in China. On average, participants had been living in the United States for 1.2 years (median .7 years) and had been studying English for 6.4 years (median 7 years). Seventy percent of the student participants identified as female and 30% identified as male.
Students in the sample represented a wide range of levels of English language proficiency (see Table 1). Of the 99 students who reported taking the TOEFL (four reported taking the IELTS), 75 reported taking the TOEFL iBT and 24 reported taking the Paper-and-Pencil TOEFL. Five students did not report TOEFL scores. Among those who took the TOEFL iBT, 74 reported their overall score and 58 also reported their section scores. Their mean score was 87 with a standard deviation of 20 points, slightly higher than the mean score of 80 and standard deviation of 21 of all examinees applying for admission to colleges or universities (ETS, 2017). Among those who reported taking the Paper-and-Pencil TOEFL, the mean score was 523 with a standard deviation of 36 points; 523 on the Paper-and-Pencil TOEFL is the equivalent of a 69–70 on the TOEFL iBT (ETS, 2005); thus students who reported Paper-and-Pencil TOEFL scores had, on average, a lower level of English proficiency than those who reported TOEFL IBT scores.
Students’ reported TOEFL scores (n = 94).
Writing instructors
All participating students’ writing instructors were invited to participate in an online questionnaire. Eighteen instructors from seven universities completed the questionnaire, representing instructors of 84% of the participating students. Twelve instructors taught a course specifically designed for ESL or international students, while six instructors taught a mainstream writing class. With one exception, all instructors were born in the United States, and 16 of the 18 instructors identified as native speakers of English. Fourteen also reported having some level of proficiency in a language other than English. Instructors also reported their highest level of education: seven had earned doctoral degrees and 11 had earned Master’s degrees in composition, English, or TESOL/TEFL. Instructors had varying levels of experience: three had taught for more than 25 years, four for 11 to 15 years, four for six to 10 years, four for two to five years and three for one year or less.
Procedures
Students participated in a two-hour data collection session either individually or in groups at a computer lab at their institution. During the session, each student participant completed an online background questionnaire, two TOEFL iBT writing tasks, and an online perception questionnaire. During the data collection sessions, students also submitted two graded academic writing assignments that they had completed for their required writing course. The instructors of all participating students received an email invitation to complete an online instructor questionnaire. All participating students and instructors received a stipend for their participation.
The TOEFL writing samples were scored in two ways: (1) they were scored at ETS using the TOEFL writing rubrics; and (2) they were scored by a team of raters at the first author’s institution using a common analytic rubric, at the same time as the two course assignments.
Instruments and materials
Student background questionnaire
The background questionnaire asked students to provide basic background information, including nationality, L1 background, field of study, English writing instruction received, TOEFL scores submitted for admission, and self-assessment of their academic writing abilities.
TOEFL iBT tasks
Two Independent and two Integrated TOEFL iBT writing tasks provided by ETS were used in the study to create two forms of the Writing section (Form A and Form B). These forms are retired forms of the operational test used for research purposes that are not publicly available. Student participants were randomly assigned to take Form A or Form B. Students completed both tasks within a total of 55 minutes (25 minutes for the Integrated task and 30 minutes for the Independent task). This design was used to reduce the chance of a potential influence of one idiosyncratic writing topic on test takers’ performance.
The TOEFL Independent task prompt asks students to agree or disagree with a statement and to provide specific reasons and examples to support their answer. In the TOEFL Integrated task students listen to a lecture and then read a passage on the same topic. Students are then asked to write an essay that summarizes the lecture and compares the information in the lecture to the information on the reading passage.
Course assignments
Student participants submitted two written assignments completed for their required writing class. The assignments included the prompt, all drafts and the instructor’s comments (if any), and the letter grade or numeric score assigned by the instructor to the final draft. For the purpose of this study, we scored the first and the final draft of each of the two course assignments. In classes where students completed in-class writing assignments, we collected the in-class writing as one of their two course assignments.
Table 2 summarizes the characteristics of the course assignments based on the prompts. Most of the course assignments asked students to write essays (72%), followed by book reports (11%), and letters (5%). Eighty-five percent of the course assignments elicited an argument in which students had to state a claim and support it with evidence. The majority of course assignments required the use of source text (47%) or source text and personal experience (32%). Finally, the majority of the course assignments were between two and four pages in length. Thus the course assignments in the required writing courses represented what Weigle (2010) refers to as “generic, undergraduate writing” as opposed to technical or disciplinary writing. Table 3 summarizes the types of writing we collected.
Summary of characteristics of course assignments (n = 303).
Note: Course assignments that could not be classified under any of the traditional categories (e.g., argument, narrative) were classified as non-traditional. For more details about the course assignments, see Grapin and Llosa (2017).
Writing samples collected.
As Table 3 shows, all participants completed two TOEFL iBT writing tasks, one Independent and one Integrated. We also collected 303 course assignments in total. Not all students submitted both the first and finals drafts for the two course assignments; some students explained that when revising, they saved over the first draft. Others had lost or failed to retain a copy of the first and or final draft. We were able to collect in-class writings from 38 students whose classes included an in-class midterm or final exam. As will be explained later, these in-class writings were judged to be comparable to first drafts.
Analytic rubric
We used an analytic rubric originally developed by di Gennaro (2013) specifically for college writers. The rubric represents a sophisticated operationalization of the writing construct and includes five dimensions rated on a scale of 0–5: grammatical control, cohesive control, rhetorical control, sociopragmatic control, and content control (see Appendix A). We made minor modifications to the original rubric to eliminate redundancy in descriptors and further clarify the dimensions to facilitate rating.
Instructor questionnaire
Instructors of participating students completed an online questionnaire that asked about their perceptions of the TOEFL iBT writing tasks and rubrics as well as their criteria for assessing student writing (see Grapin & Llosa, 2017; Llosa & Malone, 2017). As part of the questionnaire, instructors were asked to address the following questions: “How would you rate Student X’s academic English language proficiency in general?” and “How would you rate Student X’s academic English language ability by skill?” on a scale of 1–4 (1 = low; 2 = intermediate; 3 = advanced; 4 = almost native).
Rating
During a four-day rating session, a team of five raters scored the TOEFL iBT essays and course assignments using the analytic rubric. The raters included the first author and four doctoral candidates in applied linguistics/TESOL/literacy programs each with several years of teaching experience. On the first day, raters participated in an extended training session during which they became familiar with the scoring rubric, analyzed anchor papers, and completed several practice sessions. Raters alternated between scoring course assignments (draft 1 and the final draft) and TOEFL iBT essays (Independent and Integrated) in order to ensure a consistent application of the rubric criteria to the different types of papers. Each writing sample was assessed by two raters. The final dimension score was calculated by averaging the scores of the two raters. If the scores assigned by the two raters on any given dimension were more than one point apart, the paper was given to a third rater. If the score of the third rater matched the score of one of the two raters, that score was considered final. If the score of the third rater was different from the scores assigned by the other two raters, we averaged the score of the third rater with the score that was within a point. If the score of the third rater fell exactly between the scores of the two raters, the score of the third rater was considered final. For the purpose of analyses, we used the scores on the individual dimensions (0–5) as well as a total score computed by adding the five dimension scores (0–25). Table 4 reports interrater reliability estimates in terms of percentage of exact agreement, percentage of exact and adjacent agreement, and quadratic weighted kappa. 1 The table also presents Cronbach’s alpha, which was the reliability estimate used to compute disattenuated correlations.
% Exact agreement, % exact & adjacent agreement, weighted kappa and alpha for analytic scores.
The TOEFL Independent and Integrated essays were also scored at ETS by two trained raters using the corresponding TOEFL iBT holistic rubrics on a scale of 0–5 (see www.ets.org/Media/Tests/TOEFL/pdf/Writing_Rubrics.pdf). ETS provided scores for each of the two tasks as well as for the overall Writing section, which is the combined score of the two tasks transformed into a scale score out of 30.
Data analysis
Preliminary analyses
We conducted several preliminary analyses to determine how to best analyze the data collected. Detailed results of these analyses are presented in Appendix B and C. First, using independent sample t-tests, we compared student performance on Forms A and B of the TOEFL iBT tasks and found no differences, indicating that we could analyze the TOEFL iBT forms together. Next, we compared student performances on the first draft to performances on the final draft of each of the two course assignments along the five dimensions and the total score using paired sample t-tests and found that there were significant differences. The scores on the final draft of the two course assignments were higher than the scores on the first drafts, suggesting (to the relief of writing instructors) that student writing improves as a result of the writing process. We then examined whether the scores on Assignment 1 and Assignment 2 were different to determine whether we had to analyze them separately or together. We found no differences between the first drafts of Assignments 1 and 2; we also found no differences between the final drafts of Assignments 1 and 2. Thus, we decided that we could average the scores across the first drafts for Assignment 1 and Assignment 2 and across the final drafts for Assignment 1 and Assignment 2. For each student, we have one score per dimension for the combined first drafts and one score per dimension for the combined final drafts. For students who submitted an in-class writing as one of their papers, we found no differences in scores on the in-class writing and their scores on the first drafts of the other paper they submitted. Thus, we categorized the in-class writings as first drafts.
We also converted instructor-assigned grades on the two course assignments to a common scale out of 100. Several of the papers were already graded on that scale; others were graded on a scale out of 20 and subsequently converted to a scale out of 100; some papers had been assigned a letter grade. We converted those letter grades as follows: A = 95, A− = 91, B+ = 88, B = 85, B− = 81, C+ = 78, C = 75, C− = 71. The scores for the two papers were then averaged so that each student had one instructor-assigned grade on the combined course assignments.
Main analyses
In order to address research question 1 about the relationship between TOEFL iBT scores and TLU indicators of writing performance we computed Pearson correlations. We examined correlations between three TOEFL iBT scores (the score on the Independent task, the score on the Integrated task, and the score on the overall Writing section) and three indicators of TLU writing: (1) instructor ratings of students’ overall English and writing proficiency; (2) the instructor-assigned grade on course assignments; and (3) five dimensions of student performance on the first and second drafts of course assignments. For (3) we computed both observed and disattenuated correlations. The latter were computed because correlations between the variables measured by ratings with measurement error are lower than the true correlations among the abilities of interest. Thus, we corrected correlations for attenuation to provide a better estimate of the correlations among the abilities under study for all of the correlations based on variables for which we had reliability information (i.e. Cronbach’s alpha).
To address research question 2 (comparing student performance on each of the five dimension scores obtained for their performance on the TOEFL iBT tasks and that on the course assignments) we computed paired sample t-tests. We applied the Bonferroni correction and adjusted the p value to .01 to account for multiple comparisons across the five dimensions (.05/5). We also computed Cohen’s d, the effect size of the difference between mean dimension scores on the TOEFL iBT tasks and course assignments.
Findings
Relationship between TOEFL iBT scores and TLU indicators of student writing performance
Table 5 presents descriptive statistics for TOEFL iBT scores on the Independent task, the Integrated task, and the Writing section. It also contains descriptive statistics for instructor ratings of students’ overall English proficiency and writing proficiency, the instructor-assigned grade on course assignments, student performance on the first and final drafts of course assignments on the five dimensions, and student performance on the Independent and Integrated tasks along the five dimensions. Instructor-assigned grades on course assignment ranged from 71 to 98 with a mean score of 86.7. Instructor ratings’ of students’ writing and overall proficiency ranged from 1 to 4, with the mean rating on overall proficiency (2.7) being higher than the mean rating on writing proficiency (2.3). Performance on the two TOEFL tasks was comparable both in terms of the TOEFL iBT scores and the analytic scores. As would be expected, mean analytic scores on final drafts of course assignments were higher than those on first drafts.
Descriptive statistics.
In order to address research question 1, we examined correlations between TOEFL iBT scores and the TLU indicators. (See Table 6.)
Observed (and disattenuated) correlations between TOEFL iBT Writing scores and TLU indicators of writing performance.
< .05 **< .01 ***< .001.
Note: Disattenuated correlations are in parentheses.
Instructor ratings of proficiency
Correlations between scores on TOEFL iBT (Independent task, Integrated task, and Writing section) and instructor ratings of students’ writing and overall English proficiency were moderate and significant, ranging from .303 to .425.
Instructor-assigned grade on course assignments
The correlation between the scores on the Independent task and the instructor-assigned grade on course assignments was not significant. However, there was a significant but small correlation between the scores on the Integrated task and the instructor-assigned grade (.256) and between the Writing section and the instructor-assigned grade on course assignments (.246).
Performance on first drafts
The scores on the Independent task, the Integrated task, and the Writing section correlated significantly and positively with all dimension scores on the first drafts. The score on the Independent task was most highly correlated with grammatical and cohesive control and had the lowest correlation with rhetorical control. The score on the Integrated task was most highly correlated with cohesive, grammatical, and sociopragmatic control and had the lowest correlation with rhetorical control. Finally, the score on the Writing section was most highly correlated with cohesive and grammatical control and had the lowest correlation with rhetorical control in students’ first drafts.
Performance on final drafts
Correlations between the score on the Independent task, the Integrated task, and the Writing section and all dimensions on the final drafts were also positive and significant but of lower magnitude compared to correlations between TOEFL scores and first draft scores. Scores on the Independent task, the Integrated task, and the Writing section were most highly correlated with cohesive and grammatical control and had the lowest correlation with rhetorical control in students’ final drafts.
The positive and significant correlations between TOEFL iBT scores and most TLU indicators indicate that there is an association between students’ writing performance on the TOEFL iBT and their performance in their required university writing class.
Quality of the writing on TOEFL iBT tasks and course assignments
In order to compare the quality of the writing across the two settings (TOEFL iBT and university writing courses) we conducted paired sample t-tests between the dimension scores on the TOEFL Independent and Integrated tasks and dimension scores on the first and final drafts of course assignments (see Tables 7–10).
Quality of the writing on the Independent task vs. first drafts of course assignments.
< .01 **< .001.
Quality of the writing on the Independent task vs. final drafts of course assignments.
< .01 **< .001.
Quality of the writing on the Integrated task vs. first drafts of course assignments.
< .01 **< .001.
Quality of the writing on the Integrated task vs. final drafts of course assignments.
< .05 *< .01 **< .001.
Independent task
As Table 7 shows, there were no differences in students’ dimension scores on the Independent task and the corresponding ones on the first drafts of course assignments, with the exception of rhetorical control where students performed significantly better on the Independent task than on their first drafts. The effect size for rhetorical control was .46, indicating that students performed on the Independent task almost one-half a standard deviation, or about half a point (on the analytic rubric’s five-point scale), above how they performed on the first drafts. This finding indicates that with respect to aspects of writing quality, students’ performance on the Independent task is comparable to their performance on first drafts of course assignments with the exception of rhetorical control.
However, when we compared students’ dimension scores on the Independent tasks to those on the final drafts of course assignments, we found statistically significant differences (see Table 8). Students’ scores were significantly higher in their final drafts across all dimensions except rhetorical organization. The effect size for the total score (d = .52), for example, indicates that students’ performance on their final drafts was about one-half of a standard deviation above their performance on the Independent task, or about half a point. These findings suggest that performance on the TOEFL Independent task may not be representative of what students can do on their final drafts.
Integrated task
Although there were no differences between the total score on the Integrated task and the total score on the first drafts of course assignments, we found differences across some of the dimensions (see Table 9). Students scored higher on rhetorical (d = .50) and sociopragmatic control (d = .52) on the Integrated task than on first drafts. Although only marginally significant, students scored higher on content control on their first drafts than on the Integrated task. Overall, student performance on the Integrated task does not differ from their performance on first drafts, but there are some differences along some of the dimensions.
As Table 10 shows, dimension scores on the Integrated task and the final drafts differed significantly in total score (d = .30), grammatical control (d = .55) and content control (d = .56) with higher scores on the final draft of course assignments. There was also a marginally significant difference in cohesive control (d = .21). There were no differences in rhetorical or sociopragmatic control. These findings suggests that performance on the TOEFL Integrated task may not be representative of what students can do on their final drafts with the exception of rhetorical and sociopragmatic control.
Discussion
The purpose of this paper was to examine the comparability of undergraduate students’ writing performance on TOEFL iBT tasks and their performance in required writing courses using two approaches: (1) by focusing on the actual TOEFL iBT scores on the Independent task, the Integrated task, and the overall Writing section; and (2) by focusing on the quality of the writing of TOEFL iBT essays along five dimensions.
TOEFL iBT scores and performance in university writing courses
TOEFL iBT scores on the Independent task, the Integrated task, and the Writing Section were positively and significantly correlated with most TLU indicators. One strength of the current study is that we included both TOEFL iBT tasks (Independent and Integrated) and the Writing section score. Weigle (2010), the only other study to examine the relationship between scores on TOEFL iBT writing tasks and university writing, focused only on the Independent task. Previous studies have compared the Independent and Integrated tasks to each other in terms of the language they elicit and concluded that they do differ (e.g., Biber & Gray, 2013; Guo, Crossley, & McNamara, 2013). However, studies had not yet examined whether these differences meant that the score on the Integrated task would have different relationships to TLU indicators of writing performance than the score on the Independent task.
In fact, we found that performance on the Integrated task was more closely associated to student performance in the required writing course as measured by the instructor-assigned grade than performance on the Independent task. This provides some support for Weigle (2010) who concluded in her study that the Independent task may be a broader measure of language proficiency rather than writing and speculated that the Integrated task may better reflect the academic writing that takes place at the university. Even though it is possible that this finding may be due to unreliability in the instructor-assigned grade, and that there might not be a real difference in the relationship between scores, it may also be possible that the significant correlation between the Integrated task and the instructor-assigned grade may be explained by the fact that the majority of the course assignments in this study (79%) required the use of source text (see Table 4), as does the Integrated task. Nonetheless, given the small magnitude of the correlation, additional research is needed to further investigate the relationship between teacher-assigned grades on course assignments and performance on TOEFL tasks.
It is interesting to note that the correlations between TOEFL scores and instructor ratings of students’ writing and overall English proficiency were larger in magnitude than those between TOEFL scores and the instructor-assigned grade on course assignments. This outcome may be in part because the graded course assignments were final drafts that reflected student performance after one or more rounds of feedback and revision. On the other hand, the instructors’ ratings of students were likely based on their knowledge of students’ full range of language and writing ability from first to final draft. In her study, Weigle (2010) found instructor ratings to be the indicator of writing ability most highly correlated to scores on the Independent task. 3
Correlations between all TOEFL scores and all dimension scores on the first drafts of course assignments were positive and significant; correlations between TOEFL scores and final draft dimension scores were also positive and significant but of lower magnitude. Despite differences between the Integrated and Independent tasks, scores on both tasks and the Writing section were most closely associated with students’ grammatical and cohesive control in their university writing and least associated with their rhetorical control. This finding suggests that grammatical and cohesive control may be language abilities that manifest similarly across writing tasks, whereas rhetorical control may be more affected by differences in task characteristics. Since TOEFL iBT is intended to inform test users about an applicants’ readiness for university coursework, it seems reasonable that TOEFL scores are most associated with the language abilities that manifest consistently across task types.
These findings also suggest that the Integrated task expands the construct of writing assessed by TOEFL iBT so that it is more representative of writing in the TLU domain than the Independent task alone is. Scores on the Integrated task were most related to grammar and cohesive control in course assignments (as were scores on the Independent task), but the score on the TOEFL Integrated task was significantly related to the instructor-assigned grade on course assignments whereas the score on the Independent task was not. Although more research needs to be conducted on the relationship between scores on the Integrated task and indicators of writing performance, the results of this study thus far support Riazi’s (2016) and Guo et al. (2013)’s conclusions “that the integrated and the independent writing tasks share construct coverage and, at the same time, tap into different elements of writing, thus justifying the combined use of the two tasks in a single test” (p. 234).
It is important to note at this point the limitations inherent in correlational analyses. Correlations are affected by score variability and measurement reliability and thus may not necessarily reflect a true relationship between two variables. We were able to correct for unreliability of the analytic scores and the TOEFL iBT scores and presented both observed and disattenuated correlations between those variables. However, we were unable to do so for the instructor-assigned course grade or the instructor ratings of students’ writing and overall English proficiency. Unreliability in these variables may have affected the correlations presented. Also, even though the reported English proficiency of the student sample was fairly representative of the TOEFL undergraduate test-taking population, the restricted range in the sample may have attenuated the correlations.
Quality of the writing on TOEFL iBT and course assignments
In addition to examining TOEFL iBT scores, we compared the quality of the writing produced in response to the two TOEFL tasks and course assignments. Instead of focusing on the texts’ linguistic features either using Coh-Metrix (Riazi, 2016) or a corpus approach (Weigle & Friginal, 2015), we compared the quality of the writing by scoring the TOEFL iBT essays and the first and final drafts of the course assignments along five dimensions using an analytic rubric. We found that students’ performances on the Independent and Integrated tasks were comparable to their performance on first drafts of course assignments despite the differences between the task characteristics in the two settings (test vs. university writing course). The only dimension of writing where performance was not comparable to first drafts was rhetorical control. This finding can be explained by the fact that TOEFL iBT essays are significantly shorter and thus relatively easier to organize than the course assignments. In addition, rhetorical control is perhaps the easiest aspect of writing to prepare for in a testing context. In fact, students scored higher on rhetorical control on both the Independent and Integrated task than on any other dimension (see Table 7).
Finally, we found that performance on the TOEFL tasks is not representative of what students can do on their final drafts. In general, students’ performance on final drafts was better than their performance on the TOEFL tasks by about a half a point on many of the dimensions. Because final drafts by definition have benefitted from review by the writer, as well as feedback from the instructor and/or peers, this finding is logical.
It is worth mentioning that writing courses may be unique in their emphasis on the writing process. In many subject area courses, in which students are not required to revise their written work based on feedback, it is likely that the writing students produce may be similar to students’ first drafts in a writing class. If this is the case, it is possible that the quality of writing in subject area courses will be comparable to students’ writing in TOEFL iBT. On the other hand, disciplinary writing in subject area courses may be very different than the generic writing that students produce in writing courses and therefore students’ disciplinary first draft writing may not be comparable to their writing on TOEFL iBT. Weigle and Friginal (2015) found that the TOEFL iBT Independent task “elicits a fundamentally different set of lexico-grammatical features than does disciplinary writing, particularly outside the humanities” (p. 36). Future research could replicate Weigle and Friginal’s (2015) analysis using the Integrated task to see whether the relationships found differ when comparing disciplinary writing to performance on the Integrated task. Another future study could compare the performance of non-native English speaking undergraduate students in their first or second year of university on both TOEFL iBT writing tasks and tasks assigned in subject area courses; such a study would complement the current study and provide a more detailed picture regarding the comparability of performance on TOEFL iBT and writing across a range of university courses.
Conclusion
By focusing on undergraduate students in their first year of university, analyzing both first and final drafts of course assignments and selecting participants from required writing courses, this study was able to isolate as much as possible students’ writing ability from post-instruction performance. Because scores on first and final drafts were significantly different, as were relationships of these scores to TOEFL scores, this distinction is important. Therefore, studies that examine comparability of performance need to attend to the nature of the writing (with or without review and feedback) being compared to TOEFL writing. Had we only examined final drafts in our study, we would have found correlations with TOEFL iBT scores of smaller magnitude, as well as differences in the quality of the writing across TOEFL tasks and course assignments. Such an approach would have resulted in comparing performance on TOEFL to performance resulting from students’ ability enhanced by instruction, feedback and revision. Although it is important to understand that relationship, demonstrating that the abilities engaged by the TOEFL tasks are similar to those engaged by university writing courses and that student performance on the TOEFL is an indication of how students will perform initially is of critical importance. At the same time, this study sheds light on why measures such as GPA (measures of achievement post instruction) may not be appropriate for assessing the predictive validity of a language test, such as TOEFL iBT.
In conclusion, performance on the TOEFL Writing section was found to be at least somewhat associated with all dimensions of writing quality in academic writing tasks in a required writing course, but it was most strongly associated with students’ grammatical and cohesive control in their writing and with the writing they can do in first drafts. This finding supports the use of the TOEFL Writing section for decisions about whether test takers are prepared for university writing courses. The results of this study provide backing to support the TOEFL extrapolation inference, suggesting that “the construct of academic language proficiency [academic writing proficiency, in this case] as assessed by TOEFL accounts for the quality of linguistic performance in English-medium institutions of higher education” (Chapelle et al., 2008, p. 21).
Footnotes
Appendix A
Analytic Scoring Rubric (di Gennaro, 2013).
Appendix B
Appendix C
Acknowledgements
We would like to thank the many people who have contributed to this study: Anne Donovan, Cecilia Guanfang Zhao, Jing Wei, Lillian Stevens, Christopher Van Booven, Scott Grapin, and all the participating students and instructors.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by the Educational Testing Service (ETS) under a Committee of Examiners and the Test of English as a Foreign Language research grant. ETS does not discount or endorse the methodology, results, implications, or opinions presented by the researcher(s).
