Abstract
This study examines the predictive validity of the TOEFL iBT with respect to academic achievement as measured by the first-year grade point average (GPA) of Chinese students at Purdue University, a large, public, Research I institution in Indiana, USA. Correlations between GPA, TOEFL iBT total and subsection scores were examined on 1990 mainland Chinese students enrolled across three academic years (N2011 = 740, N2012 = 554, N2013 = 696). Subsequently, cluster analyses on the three cohorts’ TOEFL subsection scores were conducted to determine whether different score profiles might help explain the correlational patterns found between TOEFL subscale scores and GPA across the three student cohorts. For the 2011 and 2012 cohorts, speaking and writing subscale scores were positively correlated with GPA; however, negative correlations were observed for listening and reading. In contrast, for the 2013 cohort, the writing, reading, and total subscale scores were positively correlated with GPA, and the negative correlations disappeared. Results of cluster analyses suggest that the negative correlations in the 2011 and 2012 cohorts were associated with a distinctive Reading/Listening versus Speaking/Writing discrepant score profile of a single Chinese subgroup. In 2013, this subgroup disappeared in the incoming class because of changes made to the University’s international undergraduate admissions policy. The uneven score profile has important implications for admissions policy, the provision of English language support, and broader effects on academic achievement.
Introduction: Demographic shifts and admissions
Purdue University, a large (39,409 students), public, R1 1 institution of higher education in the state of Indiana known for its programs in STEM (Science, Engineering, Technology, and Mathematics), has long ranked among the top-three US public universities in international student enrollment. Until very recently, these rankings could primarily be attributed to its graduate students. However, from 2005 to 2015, Purdue’s international undergraduate population more than doubled, from 1894 to 5233 (+176%), and in 2010, international undergraduate students outnumbered their graduate counterparts for the first time. In the 2015–2016 academic year, the Office of International Students and Scholars (Purdue ISS, 2015) reported that international students comprised 40% of the graduate student body (3997/8974) and 18% of the undergraduate student body (5233/29,497) – for a combined international student population making up 23% (9230/38,471) of the student body.
The substantial increase in the number of international undergraduates on North American university campuses is not unique to Purdue (Fischer, 2014). This trend has been attractive to university administrators as a way of increase revenue, compensate for declining domestic enrollments, and increase campus internationalization. Because international undergraduate students pay out-of-state tuition, boosting their enrollment provides a much-needed revenue stream, especially for public institutions in North America where state funding has steadily decreased over the past decade. Increasing international admissions has also allowed many universities to maintain, or even increase, the size of their undergraduate student bodies despite a diminished pool of in-state applicants caused by declining birth rates. An associated benefit at Purdue, although it may not hold elsewhere, is that the increased admission of international undergraduates has raised the university’s academic profile. On average, Purdue’s international undergraduates enter with higher SAT total scores (+100 points), graduate at a slightly higher four-year rate (51% vs. 49%), and obtain slightly higher GPAs than their domestic counterparts from the state of Indiana. Finally, increasing international undergraduate enrollment is intended to enhance campus internationalization, which, as has been argued, may benefit domestic undergraduates, who will compete for jobs in an increasingly international and globalized job market.
Despite these benefits, many universities have struggled to address the needs of incoming international undergraduate students, and widespread reassessment of university policies, programs, and services. Comparable shifts in student demographics have triggered policy and program reassessments in the past. For example, when American soldiers returning from World War II enrolled at North American universities in record numbers, 2 programs were introduced to ameliorate their perceived lack of academic preparation and ease their transition back into civilian life (Bound & Turner, 2002; Clark, 1998). A similar focus on preparation accompanied the increase in international graduate admissions on North American campuses in the 1980s (Bailey, 1984; Ginther, 2003). In defense of this increase, faculty and administrators argued that the rapidly expanding, but controversial, international graduate student presence was “vital to certain institutions and whole fields of academic study” (Goodwin & Nacht, 1983).
Then, as now, many international graduate students were mostly funded through teaching assistantships, which require them to teach undergraduate-level courses in their major areas. Many (including faculty, administrators, and domestic undergraduates) questioned whether meeting minimum cut score admission requirements for English language proficiency indicated that international graduate students could comfortably teach introductory university-level courses in English. In order to ensure that prospective teaching assistants have actually obtained the appropriate level of proficiency, English language programs that provide post-entry proficiency screening and training for international teaching assistants (ITAs) are now customary at large public institutions that offer graduate teaching assistantships. 3 These programs are designed to address not only the gaps that may exist between entry-level English language proficiency and the level needed to provide discipline-specific content but also to introduce the expectations and practices that characterize North American teaching contexts in university settings.
The increased admissions of international undergraduate students have led to comparable concerns with entry-level English language proficiency (Fischer, 2011, 2013; Redden, 2012), and in our case, the use of entry-level language-proficiency cut scores. Until 2012, Purdue’s undergraduate admissions officers screened applicants only on the language proficiency total scores (TOEFL iBT 80; IELTS overall 7.0) submitted in their application files. However, there was a misunderstanding among many faculty about the proficiency level represented by those scores. Many faculty believed that the total cut score admissions requirement (often referred to as “passing” the TOEFL or IELTS) represented a level of English language proficiency comparable to that of entering domestic first-language (L1) English speakers; that is, faculty assumed, meeting the minimum total cut score indicates readiness to participate fully in all instructional activities on the first day of class without the need for English language support or accommodation. When faculty noticed that such assumptions did not hold, they tended to question the validity of the tests.
In a study examining faculty use and interpretation of English language proficiency test scores for graduate admission purposes at Purdue University and the University of Melbourne (Ginther & Elder, 2014), a faculty member commented, “I think all the tests are problematic. We get candidates who have passed whose English is barely functional and others whose English is very good.” Another remarked, “I have too many students who have passed the TOEFL but simply cannot communicate in English.”
Similar concerns have been expressed about the English language proficiency of incoming international undergraduates. A department head recently wrote, “some of the faculty are upset that our [undergraduate] Chinese students simply do not have the language skills to function in class. We have discussed this issue several times and a few years ago made our concerns known to the Provost’s Office. We were told that these students met minimum Purdue language requirements” (personal communication, October 8, 2015). Another faculty member, who had redesigned the introductory statistics curriculum, wrote, “a year or two ago, I was almost in crisis mode with my hybrid class because I couldn’t get the Chinese students to follow directions or talk to the American students, even when I assigned them to mixed groups” (personal communication, January 26, 2016).
Largely unaware that the minimum TOEFL cut score represents the low end of a fairly wide range of proficiency levels (80–120), faculty can be dismayed when they notice that some of their students experience difficulty with English and infer that the entire cohort is problematic. It may also be the case that faculty are less likely to notice their more proficient peers. On the other hand, a minimum total cut score of 80, while commonly used by many large public universities in the States, corresponds only to the 48th percentile for undergraduate TOEFL iBT examinees (ETS, 2015), which means admissions officers at Purdue could probably set higher cut scores without endangering international admission targets. In either case, the perceived mismatch between actual in-class performance and minimum cut score requirements has engendered three common interpretations among faculty: (1) the TOEFL may not be a reliable or valid instrument for measuring English language proficiency; (2) the minimum has been set too low; and/or (3) applicants must be finding ways to cheat on language proficiency exams.
In 2012, the Office of Enrollment Management at Purdue conducted a study that examined the relationship between international students’ TOEFL iBT total scores and first-year GPA. Predictive validity studies often begin with correlational analyses, but results often fail to exert any real influence on admissions policy or practice. In this case, the potential influence on admissions policy was considerable. Undergraduate admissions officers at Purdue and elsewhere are under pressure to meet in-state, out-of-state, international, under-represented minority, and first-generation enrollment goals from a pool of applications that has grown to more than 45,000 a year (17,000 of these are international). Therefore, securing the “best” class of incoming students entails a process that requires a considerable investment of resources. If a particular variable – for example, English language proficiency – were eliminated or given less weight, the selection process could become less complicated, more efficient, and less expensive.
Based on all incoming students’ scores from the 2011 cohort, in which students with different L1s were combined, the Office of Enrollment Management reported that the correlation between TOEFL iBT total score and first-year GPA was 0. We were asked to confirm their findings; however, in our re-analysis, we examined only Chinese students, included both total and subscale scores, and corrected for restriction of range. Range restriction as a result of admissions or employment selection is often referred to as direct range restriction. The correction for direct range restriction is crucial to the examination of predictive validity evidence for language tests because international students admitted to a university must meet a certain cut score on English language proficiency tests such as TOEFL; therefore, the admitted students only represent a subsample of the entire TOEFL test taker population and a restricted range of TOEFL scores. When we examine the extent to which language proficiency can predict academic success, often operationalized as the correlation between TOEFL and GPA, data on GPA are only available for the admitted students; therefore, the relationship tends to be underestimated (see Thorndike, 1949 for a demonstration of the impact of range restriction). In addition, we extended the analyses to the 2012 and 2013 Chinese cohorts in order to obtain a fuller picture of the relationship between TOEFL iBT and GPA across three academic years (i.e., in three different samples).
Literature review
The relationship between entry-level English language proficiency and first-year GPA
The relationship between language proficiency and academic success is predominantly operationalized in terms of correlation coefficients between language test scores and (first-year) GPA in subsequent coursework. However, there is not a clear-cut answer as to the interpretation of correlation coefficients, or what constitutes a strong or weak correlation. A commonly used interpretation guideline was proposed by Cohen (1988, 1992), where the cutoffs for small, medium, and large effects in terms of Pearson r are .1, .3, and .5, respectively. However, in sciences and social sciences, different guidelines have been proposed for the interpretation of correlation coefficients or effect sizes in general (see, e.g., Plonsky & Oswald, 2014, for a more discipline-specific guideline).
In predicative validity studies for language tests, the interpretation of correlation coefficients is especially difficult. This is largely because the predicted variable (e.g., academic success) is often dependent upon a wide array of factors. For any given single predictor variable (e.g., language test score), a correlation coefficient close to 1 (e.g., .75 or above) is unlikely to be observed. Instead, many scholars argue that even a small correlation can indicate a meaningful relationship (Cho & Bridgeman, 2012; Rosenthal & Rubin, 1982; Sackett, Borneman, & Connelly, 2008). In these cases, the strength of correlations should be interpreted relative to the expected magnitude. That is, if a weak association is expected (e.g., a construct is known to be influenced by a variety of factors, and only one of these factors is correlated with the construct), then an observed correlation coefficient of .3 can be considered unexpectedly strong and meaningful as it can account for 9% of the variance. Similarly, when interpreting the relationship between TOEFL and GPA, a correlation coefficient of close to 1 (e.g., > .75) should not be expected or used as the basis for defining a strong relationship.
To date, mixed results have been reported in studies examining the relationship between language proficiency as measured by entry-level language proficiency test scores and the most common, traditional measure of student success, GPA. A range of correlations, from moderately strong (.50, Al-Musawi & Al-Ansari, 1999) to relatively weak (.14, Light, Xu, & Mossop, 1987), can be found in the literature. However, among these studies, negative correlations have also been reported (Bridgeman, Cho, & DiPietro, 2015; Manganello, 2011; Neal, 1998).
TOEFL/GPA studies are associated with a wide range of institutional contexts and often focus on relatively small groups (sample size typically below 200) of graduate or undergraduate students within particular disciplines (e.g., Al-Musawi & AlAnsari, 1999; Johnson, 1988; Lo, 2002; Stoynoff, 1997; Zhang, 1996). In many studies, as in the initial analyses conducted at Purdue, students with different linguistic and academic backgrounds are combined. This practice is problematic as students with different L1 backgrounds may include students who have very different L2 language learning opportunities and proficiency profiles in terms of their subscale scores (e.g., Hindi vs. Chinese). In addition, in an early review of 18 studies examining the relationship between entry-level language proficiency and subsequent GPA, Graham (1987) argued that the wide variability characteristic of such studies should be expected given the different language demands across fields (e.g., science vs. liberal arts).
Furthermore, the range should not be surprising given that GPA for all students, whether international or domestic, is influenced by many factors, including not only English language proficiency, L1 background, and discipline-specific language requirements, but also study habits, content knowledge, quantitative skills, area of study, motivation, persistence, number of credit hours attempted, the need to work a part-time job and broader financial concerns, as well as integration into the larger academic and social communities of support. Graham (1987) maintains that a particular proficiency threshold, often referred to as a necessary but insufficient condition for academic success, may be what is critically important for incoming students; once the threshold is met, the importance of English language proficiency may be superseded by other factors. Given the complexity of the circumstances contributing to student success as measured by GPA, expectations about the correlation between entry-level proficiency and GPA must be tempered by acknowledgement of these additional influences.
Another critical qualification for research addressing the relationship between GPA and TOEFL scores is that it spans all three versions of the test: the paper-and-pencil test (PBT), the computer-based TOEFL (CBT), and the current version, the TOEFL iBT. The majority of the studies that examine TOEFL/GPA are based on earlier versions – the PBT or CBT – that did not include a speaking subsection or the current version of the writing subsection (Al-Musawi & AlAnsari, 1999; Johnson, 1988; Lo, 2002; Stoynoff, 1997; Zhang, 1996). The inclusion of the productive-language speaking and writing subsections on the TOEFL iBT (launched in 2005) has resulted in a decidedly different representation of language proficiency than was present in prior versions of the test. However, only a very small number of studies have looked specifically at the predictive ability of the TOEFL iBT (e.g., Cho & Bridgeman, 2012; Bridgeman et al., 2015).
To date, only two published studies have been found that address the relationship between the TOEFL iBT and GPA, both conducted by researchers at ETS. Cho and Bridgeman (2012) reported results from correlational analyses examining TOEFL iBT scores and GPA across 10 institutions with a total sample of 2594 undergraduate and graduate students. Although samples from different institutions were combined, the total score/GPA correlations were consistently weak for both undergraduate (.18) and graduate students (.16). Analyses were then augmented by the use of contingency tables. These tables indicated that students with higher entry-level scores on a variety of admissions tests including the TOEFL iBT did indeed tend to have higher subsequent GPAs.
In contrast to the weak correlations reported by Cho and Bridgeman (2012), Bridgeman et al. (2015) found remarkably strong correlations for TOEFL iBT scores and first-year GPA for Chinese undergraduate students at an urban university in the States. They reported as follows: when all students [with different L1s] were pooled in a single analysis, the correlation of scores from the Test of English as a Foreign Language (TOEFL) with GPA was .18; in a subsample of engineering students from China, the correlation with GPA was .58, or .77 when corrected for range restriction. Similarly, the corrected correlation of the TOEFL Reading score with GPA for Chinese business students changed dramatically (from .01 to .36) when students with an extreme discrepancy between their receptive (reading/listening) and productive (speaking/writing) scores were trimmed from the sample.
Three aspects of the findings reported by Bridgeman et al. (2015) parallel those we will report in the following sections. Correlations changed when different subsamples were selected (all incoming undergraduate students vs. Chinese only); correlations increased when corrected for restriction of range; and, as will be seen, when a subgroup of Chinese students with a noticeably discrepant score profile (students with an extreme discrepancy between their receptive and productive scores was isolated or removed from the analyses, correlational patterns changed.
In addition, Bridgeman et al. (2015) reviewed the methodological problems of TOEFL/GPA studies that may influence the results of correlational analyses. As a result of selection in the admissions process, the range of entry-level test scores is restricted, but subsequent correlations are often reported without correction for restriction of range. While restriction of range is an inevitable result of selection, it becomes problematic when researchers draw conclusions about the relationship between language proficiency and GPA without correction for restriction, as the correlation observed in a sample with restricted range tends to underestimate the relationship between the variables involved.
Despite the variety of results reported in the literature, in practice US institutions of higher education do recognize the importance of entry-level English language proficiency, as demonstrated by their “front-end” admission policies (proficiency requirements) and “back-end” instructional practices (accommodations). An institution’s relative prioritization of these two approaches can largely be predicted by its selectivity ranking in one of three broad tiers.
For example, Carnegie Mellon (2015), a highly selective North American institution renowned for its STEM programs, provides the following admission information for prospective international applicants: The Test of English as a Foreign Language (TOEFL) or the International English Language Testing System (IELTS) is required if your native language is not English. Carnegie Mellon requires TOEFL scores of 102 or better on the internet-based TOEFL (as of Fall 2010) or an IELTS score of 7.5 and above. Carnegie Mellon carefully reviews the sub-scores of each of these exams and considers those candidates with reading, listening, speaking and writing sub-scores of 25 or more on TOEFL and 7.5 or more on IELTS to be candidates with high levels of English proficiency.
Like most highly selective institutions in North America, Carnegie Mellon does not require international students to take post-entry language proficiency testing or enroll in English for academic purposes (EAP) classes; in their view, restrictive “front-end” selectivity conditions obviate the need for “back-end” accommodations.
As entry-level language proficiency requirements fall, requirements for post-entry language proficiency testing and enrollment in EAP language support programs tend to increase. For example, the University of Illinois at Urbana-Champaign (UIUC) has set its general entry-level TOEFL total score requirement at 79 (as compared to Carnegie Mellon’s 102) and requires post-entry English language proficiency testing of admitted students with entry-level TOEFL iBT scores of between 79 and 102, but it also requires international students to enroll in EAP courses if they do not meet a certain score threshold on either test. Fully developed EAP programs such as UIUC’s (and Purdue’s) are typically found at large, public institutions with strong academic reputations and second tier selectivity.
International students whose TOEFL scores do not meet the typical minimal total cut score for admission have the option of enrolling in intensive language programs. Intensive English Institutes (IEIs) are pre-admission programs that typically admit students with TOEFL total scores below 80, and they require a minimum of 20 hours of classroom instruction in English per week. A perceived advantage for these students is that they are often subsequently admitted on the basis of successful program completion.
Institutional acceptance of the importance of entry-level English language skills is reflected in several admissions practices: Students with TOEFL iBT total scores below 80 are not typically considered for admission but have the option to enroll in an IEI to improve their language skills; students with scores from 80 to 100 are re-tested locally and then placed in EAPs; and those testing above 100 are exempt from post-entry testing and/or mandatory EAP courses. These practices also suggest that the threshold proficiency levels recommended by Graham (1987) have been established in practice, although it is unclear how or why the use of these particular cut scores has developed into standard practice.
Selection criteria restrict the possible score range of admitted international students at a given university, but variation in score profiles also results from the common admissions practice – found especially in universities at the second tier of selectivity – that only considers total scores. Though widely varying score profiles meet the same total score requirement, they may have different implications with respect to subsequent academic success and the provision of English language support. For example, with a total score of 100 on the TOEFL iBT, a student who has a score of 25 on all subskills is decidedly different in his or her abilities in English from a student who has a score of 30 on listening and reading subsections but only a score of 20 on speaking and writing subsections. Therefore, even within a particular range of English language proficiency test scores, there can be considerable variability of English language skills among international undergraduate students, as reflected in their score profiles, and these profiles may have an impact on students’ academic success. Despite the potential importance of score profiles, researchers have not investigated how score profiles predict academic success.
The remainder of this article will report the results of our analyses of three consecutive incoming cohorts of international undergraduate Chinese students at Purdue University, in which we examined the relationship between language proficiency (TOEFL iBT total and subscale scores) and first-year GPA. Specifically, our (re-)analyses were set out to address the following research questions:
What are the relationships between TOEFL iBT total and subsection scores and first-year GPA for Chinese students, Purdue’s largest international undergraduate subgroup of incoming students?
Are there different TOEFL iBT score profiles present among Chinese undergraduate students?
Is there an impact of particular TOEFL iBT score profiles on Chinese students’ first-year GPA?
Method
Data
The data used in this study included academic records of 1990 undergraduate international students enrolled at Purdue University from the 2011–2012 academic year to 2013–2014 (N2011 = 740, N2012 = 554, N2013 = 696). Specifically, for international undergraduates who submitted TOEFL iBT scores for admission to the university, we requested the following information:
Demographic information: native country, college, and department;
First-year GPA as broken down by: first-semester, second-semester and first-year GPA;
TOEFL iBT scores: total and subsection scores, test date.
The most recent TOEFL score report was used for students who submitted multiple test score reports.
Analyses
The relationship between TOEFL and GPA was examined through two steps of analysis: (1) correlations of TOEFL iBT total and subsection scores with first-year GPA (including first- and second-semester GPAs); and (2) cluster analyses of TOEFL iBT score profiles and their relationships with first-year GPA. Analyses were conducted separately for each academic year to control for possible differences of grading standards and test scores across the years. Both correlational analyses and cluster analyses of the data were performed using IBM SPSS, Version 21 (IBM Corp, 2012).
Correlational analyses
Pearson r correlations were computed to estimate the relationship between TOEFL iBT scores and first-year GPA (Research Question 1), including both observed correlations (r) and correlations corrected for range restriction (adjusted r). As mentioned above, a common problem in predictive validity studies in educational and psychological research is that both variables tend to be restricted in range due to admissions or employment selection (also referred to as direct range restriction; Wiberg & Sundstrom, 2009). For example, in investigations of the predictive validity of higher education admission tests (e.g., TOEFL iBT), the test scores tend to represent a restricted sample of the entire TOEFL iBT test taker population as only students admitted to the program or school are included in the analyses. Therefore, the observed correlations tend to underestimate the relationship between the variables involved. The adjusted or corrected correlations, on the other hand, represent a reasonable expectation of the theoretical relationship existing between the two variables in the population.
In this study, observed correlations were adjusted using the Thorndike Case II formula (Thorndike, 1949), which can be expressed in the following mathematical equation:
where
This formula is considered appropriate for this study because the Thorndike Case II correction formula is suitable for direct range restriction – that is, when test score range restriction is caused directly by the selection of applicants (for a detailed discussion on the selection of appropriate correlation correction method, see, e.g., Sackett & Yang, 2000; Wiberg & Sundstrom, 2009). Because the Thorndike Case II formula requires unrestricted population parameters, the mean and standard deviation of TOEFL iBT scores for undergraduate-level students, as reported in the TOEFL iBT test and score data summary (ETS, 2010), were used as proxies of the unrestricted population parameters.
Cluster analyses
In order to explore the possible TOEFL iBT score profiles within the three international undergraduate cohorts (Research Question 2), both a hierarchical cluster analysis and a subsequent K-Means cluster analysis of TOEFL iBT subsection scores were performed. This two-stage approach of cluster analysis can help identify the number of score profiles and the average subsection scores of each score profile, respectively. Performing hierarchical and K-Means cluster analyses successively is recommended because the two clustering algorithms can overcome the shortcomings of each other when they are used in conjunction (Mooi & Sarstedt, 2011; Punj & Stewart, 1983).
Specifically, hierarchical cluster analysis, using Ward’s method of minimum within-group variance, was performed to identify the number of score profiles (i.e., clusters of TOEFL iBT subsection scores) existing among Chinese students enrolled each academic year. As hierarchical cluster analysis is largely exploratory, the number of clusters is often determined visually through a scree plot, which illustrates the change in agglomeration coefficient as the number of potential clusters increases. The optimal number of clusters is determined by locating a clear demarcation point or a big drop in the change in agglomeration coefficient in the scree plot (Burns & Burns, 2009). The optimal number of clusters was then specified in a subsequent K-Means cluster analysis to extract the average subsection scores of each score profile (i.e., the centroids of the observations in each cluster) and the number of observations in each score profile. In addition, to address Research Question 3, boxplots of first-year GPA by score profile were generated to examine the possible impact of TOEFL score profile on first-year GPA.
Results and discussion
The results of the study showed that the TOEFL total score alone does not predict first-year GPA. However, when subscale scores and score profiles were examined, different subskills demonstrated differential relationships with first-year GPA. More interestingly, first-year GPA also differed across score profiles. These findings suggest that the focus of admissions practice should be shifted from total score to both subscale scores and score profiles.
Descriptive statistics of TOEFL iBT scores and first-year GPA
Table 1 presents the summary statistics for TOEFL iBT (total and subsection) scores and first-year GPA of Chinese students by academic year. The average TOEFL total scores had little variability across the three years, the means of which clustered around 90. The mean subsection scores remained similar as well, with higher reading and listening scores and lower writing and speaking scores. The average subsection scores represent the typical Chinese students’ scores or language skills profile. Similarly, the first-year GPAs (as well as first- and second-semester GPAs) of Chinese students across the three cohorts centered around 3.0. The average first-year GPAs for Chinese students enrolled in 2011 and 2013 were around 3.0, whereas the average first-year GPA for 2012 was slightly below 3.0. The differences in average GPA invite separate examinations of correlations between TOEFL scores and GPA for each academic year.
Descriptive statistics for TOEFL iBT scores of Chinese students by year.
In spite of comparable average TOEFL iBT scores across the three academic years, the ranges of scores on the TOEFL writing and speaking subsections across the three academic years displayed noticeable differences (see Appendix A). The changes in score ranges on productive skills were largely a result of the new admissions policy effective in 2013, which set cut scores of 18 on both TOEFL iBT writing and speaking subsections. Consequently, there were few enrolled students in the academic year 2013–2014 who had TOEFL writing and speaking scores below 18.
It should be noted that the change in admissions policy occurred after the authors presented the re-analysis of the two academic cohorts at an English as a second language (ESL) summit, a campus-level meeting about language support and needs for international students that was attended by administrators from the Office of the Provost and major colleges and departments at Purdue. This policy change aligns with the widely expressed concerns from the faculty about the English proficiency level of incoming international students; however, setting the minimum score at 18 was an arbitrary decision that fortunately, as will be seen in the following section, has made a positive change to the relationship between TOEFL and first-year GPA for Chinese students.
Correlations between TOEFL iBT scores and first-year GPA
Tables 2, 3, and 4 present both the observed and adjusted correlation coefficients of TOEFL iBT scores with first-semester, second-semester, and first-year GPAs for academic years 2011–2012, 2012–2013, and 2013–2014, respectively. Overall, the correlational patterns were similar for the two semesters and across the first year; therefore, we chose to report and discuss only the correlation coefficients for first-year GPAs in the subsequent sections.
Observed and adjusted correlations between TOEFL scores and GPA – 2011 (N = 740).
Observed and adjusted correlations between TOEFL scores and GPA – 2012 (N = 554).
p < .05, **p < .01, ***p < .001.
The Chinese students from the 2011 cohort demonstrated interesting patterns of correlations between TOEFL iBT scores and GPA. As shown in Table 2 (the sixth column), the correlation between TOEFL iBT total score and overall first-year GPA was close to zero (roverall = .07). However, when broken down into subsection scores, there were strong and positive correlations observed between speaking and writing subsection scores and first-year GPA (rwriting = .27***, rspeaking = .15***). After correction for range restriction, the correlation coefficients increased (rA_writing = .41, rA_speaking = .28). More interestingly, the correlations between TOEFL reading and listening scores and first-year GPA were negative (rreading = −.07, rlistening = −.13***). After correction for range restriction, the negative correlations increased (rA_reading = −.13, rA_listening = −.22).
Similar correlations were observed in the 2012 cohort (see the sixth column of Table 3). Overall, the correlation coefficient between TOEFL iBT total score and first-year GPA was close to zero (roverall = .04). The observed correlation coefficients between speaking and writing subsection scores and first-year GPA were positive and stronger when compared with the correlations for the 2011 cohort (rwriting = .32***, rspeaking = .25***). Again, after correction for range restriction, the correlations increased (rA_writing = .47, rA_speaking = .44). The adjusted correlations indicate that, in an unrestricted or full range of language proficiency, writing or speaking skill alone can explain approximately 25% of the variance in first-year GPA. In contrast, the observed correlations between TOEFL reading and listening scores and first-year GPA remained significantly negative (rreading = –.19***, rlistening = −.21***). After correction for range restriction, the negative correlations increased (rA_reading = −.38, rA_listening = −.38). The reversed correlational patterns for receptive and productive skills were corroborated by the negative correlations between reading/listening scores and speaking/writing scores on the TOEFL iBT for the 2011 and 2012 cohorts, suggesting that first-year GPA is negatively related to listening and reading scores on the TOEFL iBT but positively related to speaking and writing scores (see the Appendix B for the correlation matrix for TOEFL iBT subscale scores by academic year, with the negative correlations in bold).
However, the 2013 cohort demonstrated a different correlational pattern between TOEFL scores and GPA. As shown in Table 4 (the sixth column), the correlation between TOEFL iBT total score and overall first-year GPA was significant and positive (roverall = .13**). After correction for range restriction, the correlation coefficient increased (rA_overall = .38). This result appears in line with the general expectations as it indicates that general language proficiency, as reflected in TOEFL iBT total scores, can account for around 15% of the variance in first-year GPA. When broken down into subsection scores, writing subscale scores showed the strongest correlation with first-year GPA (rwriting = .15***). However, the negative correlations between TOEFL reading and listening scores and first-year GPA disappeared (rreading =.09*, rlistening = .05).
Observed and adjusted correlations between TOEFL scores and GPA – 2013 (N = 696).
p < .05, **p < .01.
In summary, for the 2011 and 2012 Chinese student cohorts, moderately strong and positive correlations were observed between speaking and writing subscale scores and GPA, and moderately strong but negative correlations were observed for listening and reading. In contrast, for the 2013 cohort, the correlation coefficients dropped substantially and writing subscale scores showed the strongest correlation with GPA; and the negative correlations disappeared. Overall, writing and speaking scores tended to predict first-year GPA consistently across academic years, whereas reading and listening scores demonstrated counter-intuitive correlational patterns with first-year GPA for this sample.
Results of the correlation analyses for the 2011 and 2012 cohorts seem to suggest that students who had stronger reading and listening skills performed worse than students with poorer reading and listening skills. These results are counter-intuitive, as one would expect these language skills to contribute to rather than inhibit academic success. In addition, although reading, listening, writing, and speaking are often regarded as separate language skills, these skills have been demonstrated to be subsumed under a higher-order construct, that is, general language proficiency (e.g., Sawaki, Stricker, & Oranje, 2009). The opposite correlational patterns for TOEFL reading and listening scores (reflecting receptive skills) and TOEFL writing and speaking scores (reflecting productive skills) indicate that summing these subscale scores may be problematic, especially for the selection of applicants for university admission purposes.
The difference in the correlational patterns between TOEFL scores and GPA in the 2013 cohort prompted further investigation of the relationship among subsection scores or score profiles. Note that, in 2013, the new admission policy regarding TOEFL subsection scores had eliminated students with very low subscale scores, especially on writing and speaking. The reduced range is reflected in the smaller standard deviations for reading, listening, and writing scores (see Table 1, last column). In addition, the substantial drop in the writing correlations with GPA from 2012 to 2013 is worthy of note. Although the standard deviation dropped only modestly (a difference of 0.51), removing essentially all the students with very low writing scores impacted the GPA correlation as well. More interestingly, changes in writing/speaking scores appeared to have had an impact on the correlational patterns for reading and listening scores, suggesting that in addition to individual subskills, the combination of subscale scores (i.e., score profile) might also be at play. Thus, the examination of the relationship between TOEFL and GPA should take subsection scores and score profiles into consideration. If selection committees were to use cut scores on subskills, rather than cut scores based on the total score, and pay attention to score profiles, then the performance of the cohort as reflected by first-year GPA might improve.
Emergent TOEFL iBT score profiles
To address Research Question 2, Figure 1 presents the scree plot for the hierarchical cluster analyses of TOEFL iBT subsection scores of Chinese students, with each line representing a separate analysis for each academic year. All three lines point to the same demarcation point at three clusters, suggesting that there were three TOEFL iBT score profiles among Chinese students in each academic year.

Scree plot for hierarchical cluster analysis of TOEFL iBT subscale scores.
When the optimal number of clusters was entered in the subsequent K-Means cluster analysis, the cluster centroids (i.e., the center of each cluster or score profile group) of each score profile were extracted as shown in Table 5. Altogether, there were four score profiles but only three profiles in each academic year; this means that there were different score profiles among Chinese students in each cohort. The four score profiles are labeled as follows as well as graphically presented in Figure 2:
Balanced high score profile: a score of 25 or above on all four subsections;
RLW > S 4 : higher reading, listening and writing scores but a lower speaking score; this profile features reading and listening scores of 25 or above, writing scores of 22 or above, but a speaking score of around 20;
Balanced low score profile: a score of around 21 on all four subsections;
Discrepant score profile: very high reading and listening scores but very low writing and speaking scores; this profile features a score of 28 or above on reading and listening, but a score of 18 or below on writing and speaking.
TOEFL iBT subscale score profiles among Chinese students (centroids of clusters).

The four TOEFL iBT score profiles.
Among the four score profiles, two profiles, that is, RLW > S and balanced low, appeared consistently across the three cohorts. As shown in Table 5 (the third column), both score profiles were common among Chinese students, with the balanced low score profile (n2011 = 313, n2012 = 203, n2013 = 283) slightly more common than the RLW > S score profile (n2011 = 240, n2012 = 205, n2013 = 228). It is interesting to note that the discrepant score profile appeared in both the 2011 and 2012 cohorts (n2011 = 187, n2012 = 146) but disappeared in the 2013 cohort. This profile was replaced by the balanced high score profile (n = 185). The absence of discrepant score profiles in 2013 is reasonable because of the new admission policy on the cut scores on the TOEFL iBT writing and speaking subsections. Recall that a minimum score of 18 for each of the two sections was added to the selection criteria for admission in 2013. Consequently, a large number of students with the discrepant score profile were excluded. This change of score profile co-occurred with the change in correlational patterns between TOEFL total, subscale scores, and first-year GPA, suggesting the possibility of a score profile effect on first-year GPA.
The relationship between TOEFL iBT score profile and first-year GPA
Table 6 further presents summary descriptive statistics of TOEFL iBT total and subsection scores and first-year GPA by score profile and academic year. As shown in the table (the last column), the average first-year GPAs for students with the RLW > S, balanced high and balanced low score profiles tended to cluster around 3.0, with the RLW > S and balanced high groups slightly higher than the balanced low group. This result suggests that students with lower language abilities tend to perform less well academically than students with higher language skills. However, students with the discrepant score profile performed the least well in terms of first-year GPA, with an average first-year GPA well below 3.0 (M2011 = 2.66, SD2011 = 1.01; M2012 = 2.24, SD2012 = 1.11). Moreover, boxplots of first-year GPA by score profile in Figures 3 and 4 show that students with a discrepant score profile had the widest range of first-year GPAs when compared to the other score profile groups. These results suggest that score profile has an impact on first-year GPA in that the discrepant score profile group performed least well on academic courses when compared to other score profile groups.
Means and standard deviations of TOEFL iBT scores and first-year GPA by score profile and year.

Boxplots of first-year GPA by TOEFL iBT score profile, 2011–2012.

Boxplots of first-year GPA by TOEFL iBT score profile, 2013.
Additionally, results of the cluster analyses help explain the negative correlational patterns between TOEFL reading and listening scores and first-year GPA. As is clearly illustrated in Figure 3, the discrepant score profile group had the lowest first-year GPA. Recall that the discrepant score profile features the highest average reading and listening scores (≥ 28) and the lowest writing and speaking scores (≤ 18) among all score profile groups. Therefore, the negative correlations between reading/listening and speaking/writing scores were, most likely, the result of the discrepant score profile subgroup. The number of students with this score profile was noticeably reduced after the new admission policy was implemented in 2013, and subsequently, the correlational patterns aligned with general expectations; that is, language proficiency had a moderate but positive relationship with academic success as measured by first-year GPA.
Summary and conclusions
Inconsistent results characterize the research literature examining the relationship between entry-level language proficiency, as measured by the TOEFL iBT, and academic success, typically measured by first-year GPA. Such inconsistency is likely because the relationship between language proficiency and academic success may differ across student groups of different L1 backgrounds, and, as is shown in this study, different English language proficiency profiles within single student groups.
A recent study conducted by the National Association for College Admissions Counselors (NACAC) (2016) reports that of the 400 institutions surveyed, slightly more than 50% conducted institutional validity studies on the use of the SAT and ACT for domestic admissions, and 59% of these institutions did so annually. The association argues that the need for local validity studies is growing due to changes in these tests and student demographics: “Recent changes in the content of the SAT, increased use of the SAT and ACT as high-school assessment instruments, and the changing demographics of students who take the tests could all affect the predictive validity of test scores” (p. 4). In order to ensure appropriate score use and interpretation of language proficiency tests, language testers and admissions officers responsible for international student admissions need to do the same. As Fischer (2011) comments, “Getting it right matters, as colleges step up overseas admissions to diversify their student bodies, internationalize their campuses, and buttress their bottom lines.”
Our study began with a re-analysis of the data examined in an internal analysis of TOEFL iBT/GPA conducted by Enrollment Management at Purdue. However, these analyses were extended to include not only the relationship between TOEFL iBT total scores and GPA but also the relationship between TOEFL iBT subscale scores and GPA. We disaggregated the sample and examined only a single, but the largest, L1 group, first-year Chinese students. In addition, we corrected the correlation coefficients for range restriction and thereby avoided shortcomings common to studies that examine the relationship between entry-level language proficiency and academic success.
As our initial results were so unexpected, we extended the study beyond our initial analysis of a single cohort and included two additional years of student data. Cluster analyses were included along with correlational analyses to determine whether distinctive TOEFL score profiles might underlie and help explain the unexpected patterns of correlations that were observed in the 2011 and 2012 cohorts. The identification of the unexpected correlations among these subgroups led to a revision of university admissions policy and practice, demonstrating that, on occasions when policy and research interests align, collaboration between admissions officers and language testing researchers has the potential to produce very real short- and long-term effects. At Purdue, the TOEFL iBT cut score requirement of 80, which remains the standard for admission at many U.S. institutions, was revised by adding minimum subscale requirements of at least 18; and this, apparently slight, shift in practice can be argued to have produced the substantive change in results that we observed in the third year of our study. The additional cluster analyses that augment the correlational analyses reveal that even within a single L1 subgroup, different score profiles may be present. Happily for admissions officers at Purdue University, the change in policy also seems to have resulted in the emergence of a new subgroup of Chinese students – those with high subscale scores across all four sections of the TOEFL iBT. The presence of this examinee profile may allow some of Purdue’s international students to compare favorably with those admitted to Purdue’s aspirational peers, and it will be interesting to see whether there will be any improvement in four-year GPA and graduation rates when the class graduates in 2017.
The presence of the discrepant score profile subgroup in Purdue’s 2011 and 2012 Chinese student cohorts raises some interesting questions. One wonders whether this profile is something that results from a particular style of TOEFL test preparation. It might be the case that extreme or excessive test preparation may lead to an unexpected advantage on the multiple-choice sections of the TOEFL (reading and listening) to the detriment of examinees’ development of productive-language abilities (speaking and writing). In addition, the possibility that discrepant score profile examinees may be cheating on the exam cannot be ruled out; however, empirical evidence is needed in order to better explain this score profile.
In the 2015 study conducted by Bridgeman et al., examinees who displayed the discrepant score profile were “trimmed” from the sample in order to present what was argued to be a more accurate representation of the relationship between entry-level language proficiency scores and first-year GPA. In the current study, examinees with the discrepant score profile were deliberately excluded from admission by a change in institutional admissions policies and procedures. However, it should be noted that the number of students who displayed this profile in 2011 and 2012 was fairly large (n2011 = 187 and n2012 = 146), and it may not be possible at many universities to “trim” these applicants in actual admissions practice. Indeed, in many cases, such trimming may not be desirable for institutions that value hitting enrollment targets more than the composition of English language skills displayed by the incoming class. However, Chinese students who display the discrepant score profile may need different kinds of support after arriving on campus due to difficulty in performing at the same levels as their peers who display the more expected balanced low or RLW > S language proficiency profiles.
In conclusion, the current study demonstrates, at the very least, that informed selection procedures and policies should include consideration of language proficiency subscale scores and score profiles, and that institutional selection policies can be enhanced when enrollment and language testing professionals collaborate. It is hard to identify at-risk students with the discrepant score profile when selection procedures include only the consideration of examinees’ TOEFL iBT total scores. Interestingly, in this study, the direct range restriction and admissions policy allowed for explorations of how the relationship between language proficiency and GPA is moderated by score profile. The changes in correlational patterns in 2013 after the discrepant score profile was removed demonstrate that score profiles can moderate the relationships between language proficiency and academic success. These changes can also help researchers and policy makers recognize the importance of a certain threshold entry-level English language proficiency (e.g., speaking and writing skills) in academic success of international students.
Footnotes
Appendix
Correlations among TOEFL iBT scores by academic year.
| Year | Reading | Listening | Speaking | Writing | Total | |
|---|---|---|---|---|---|---|
| 2011 | Reading | – | .55** |
|
|
.59** |
| Listening | – |
|
|
.65** | ||
| Speaking | – | .52** | .26** | |||
| Writing | – | .31** | ||||
| 2012 | Reading | – | .52** |
|
|
.47** |
| Listening | – |
|
|
.59** | ||
| Speaking | – | .57** | .29** | |||
| Writing | – | .35** | ||||
| 2013 | Reading | – | .41** | .11** | .16** | .63** |
| Listening | – | .31** | .23** | .74** | ||
| Speaking | – | .46** | .67** | |||
| Writing | – | .66** |
p < .01.
Acknowledgements
We would like to thank the Provost’s Office and the Office of Enrollment Management at Purdue University for their invitation to examine the data included in this study. We are particularly grateful to Joe Potts, the Associate Dean of International Programs when the study began, and to Mike Brzezinski, the Dean of the Office of International Students and Scholars for his interest and assistance from start to finish. We greatly appreciate Purdue program administrators’ willingness to allow careful examination of the use and interpretation of language proficiency test scores for international undergraduate admissions purposes.
We thank the three anonymous reviewers and our editor, Carol Chapelle, for the comments, queries, and assistance provided to revise and improve our original submission. Any errors that remain are our own.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
