Abstract
This study used survival analysis to examine the patterns and factors associated with time to achieving designated score criteria on a test of English as a foreign language. This was modeled using an extension of the Cox regression model, with two criterion score levels defined as achieving a TOEFL iBT® total test scale score at or above the Common European Framework of Reference (CEFR) Level B2 and at Level C1, respectively. Factors included in the model were test taker background characteristics including age, gender, native language type, exposure to English, and reason for testing. Additionally, to account for those who tested more than once within the study period, and thus had multiple records, an indicator for order of testing occasion was included in the model.
Results indicate that approximately 82% of the test takers in our study sample tested one time in the study period (2014–2016), and the number of repeaters decreased rapidly across occasions. For those who did not achieve the designated criterion scores at first testing, the likelihood of achievement increases with repeated testing, with a somewhat greater effect for the less stringent B2 criterion. Results also indicate that the association of gender with performance differed across levels.
Keywords
With increasing globalization of market economies, multi-nationalization of business and industry, international mobility of the workforce, and the rise of English as a lingua franca, greater heterogeneity and diversity of the test taker population for English as a foreign language (EFL) can be expected. For example, candidates may need to demonstrate proficiency in English for a number of reasons including immigration, employment, admission to graduate study, professional certification, to name a few. Moreover, as testing companies provide greater flexibility for retesting (Barkaoui, 2017), there may be greater variation in test-taking patterns dependent on test use and associated stakes. As a result, there is a need to identify those factors that impact time to achieving a designated level of proficiency to support test users and the educational institutions which provide instructional services to support EFL learners.
Research has shown that performance on large scale assessments may be influenced by a number of factors related to the individual, the assessment context, and the interaction thereof. In the field of language testing, the influence of test taker background characteristics on language ability has been studied extensively. Factors found to be associated with language proficiency and performance on language assessments include cultural background, native language, cognitive ability, gender, age, and the degree and type of exposure to the language being assessed (Bachman, 1990; Gradman & Hanania, 1991; Kunnan, 1998).
Another factor in test taker performance may be repeater status owing to additional learning of tested content and other factors associated with retest decisions 1 . As noted by Haberman and Yao (2015), test takers “may be more likely to repeat tests if their scores are lower than desirable for graduate admissions or if they have more financial resources” (p. 224).
To further explore the association of test taker characteristics with test performance over time, this study used survival analysis, also known as event history analysis, to examine the patterns and factors associated with time to achieving designated score criteria on a test of English as a foreign language, the TOEFL iBT®. Widely used in medical research, survival analysis is a class of statistical methods for studying the occurrence and timing of events in longitudinal data where the event of interest represents the transition between two mutually exclusive states (e.g., dead or alive, pass or fail, meeting a designated criterion score or not). In this context, survival is defined as not experiencing the event of interest. For each subject, there is the associated risk or likelihood of experiencing the event (not surviving) at a given time point. This risk, referred to as the hazard rate, is quantified for different time points throughout the study period through estimation of a statistical model, the choice of which is determined by the research question.
Similar to other modern approaches to longitudinal analysis, such as multilevel modeling for change, survival analysis allows variation across and within individuals in the frequency and timing of measurement occasions and/or data collection (Singer & Willett, 2003). However, unlike other regression-based longitudinal methods, the advantage of using survival analysis is that information from sample members that have not experienced the event of interest during the study period can be included in model estimation (Allison, 2010). These cases are referred to as censored. Two of the most common types of censoring are right censoring and left censoring. Right censoring indicates that the sample member has not yet experienced the event of interest by the end of the study period. Left censoring indicates that the sample member experienced the event prior to the start of the study period. Thus, the study sample need not exclude those for whom the occurrence and timing of the event of interest, such as reaching a criterion test score, is unknown.
Within educational contexts, survival analysis has been used to examine a broad range of issues and topics such as student mobility and grade retention (Finch et al., 2009), teacher turnover (Kelly, 2004; Vagi et al., 2019; Willett & Singer, 1991), university faculty retention and promotion (Box-Steffensmeier et al., 2015), and college completion rates (Chimka et al., 2007; Gury, 2011; Visser & Hanslo, 2005; Zwick & Sklar, 2005). More recently, survival analysis has been used to examine test performance. In this context, survival analysis can be used to estimate the likelihood that a test taker will achieve a particular score or proficiency level within a given time frame. For example, de Champlain et al. (2004) used the Cox Proportional Hazard model (Cox, 1972) to examine passing rates for Step 2 of the United States Medical Licensing Examination as a function of gender, location of medical school (United States/Canada vs. Other), and primary language (English vs. Other). Results indicated strong positive effects for United States/Canadian training and English as primary language, but little effect for gender.
In the context of proficiency testing of English as a foreign language (EFL) or English as a second language (ESL), there have been fewer studies, with most in the ESL context. For example, Conger (2009) used survival analysis to examine time to reaching minimal proficiency on an English-language test battery using data for four school entry cohorts of English-language learners aged 5–10 years. Results indicated that although approximately half the sample reached proficiency in three years, the younger entrants reached proficiency more quickly than older entrants, even after controlling for other factors including economic, demographic, disability, and school effects. More recently, Burke et al. (2016) examined emergent bilinguals’ time to reclassification as fluent English proficient using five years of statewide test data and found effects related to home language, socioeconomic status, and special education status. Information gleaned from such studies may be used to inform both curriculum and instruction, and educational policy as pertains to appropriate test use.
Given the educational importance of understanding factors that support English learners in attaining proficiency, this study examined the patterns of test performance in the EFL context. Specifically, this study addressed the following questions:
What are the test-taking patterns in time to achieving a designated criterion score level on the TOEFL iBT?
What test taker characteristics are associated with these patterns?
To address the research questions of interest, an extension of the Cox regression model (Cox, 1972; Cox & Oakes, 1984) was used to model time to achieving a designated criterion score level as a function of test taker background characteristics. The test taker background characteristics included in the study were age, gender, native language type 2 (Indo-European [IE] or Non-Indo-European [NIE]), exposure to English (time studying English, time spent in a content class taught in English, time living in a country where English is the main language), reason for testing, and repeater status of the test taker. Answers to these research questions will add to the body of knowledge pertaining to timelines to reaching proficiency in English as a foreign language and the factors associated with test taker achievement patterns.
Method
The measurement instrument
As previously noted, this study examined performance on the TOEFL iBT developed by Educational Testing Service (ETS) to measure the ability of non-native speakers to use English at the university level. First launched in the United States on September 24, 2005, this test is taken by a very diverse population of test takers. As noted by ETS on the TOEFL website, “More than 35 million people from all over the world have taken the TOEFL® test to demonstrate their English-language proficiency” (https://www.ets.org/toefl/ibt/about/). In recent years, test takers represent approximately 230 native countries and 140 native languages (ETS, 2014, 2015, 2016, 2017, 2018b).
The TOEFL iBT test is designed to provide evidence of academic language proficiency in the four language modalities of reading, listening, speaking, and writing. Tasks in the reading and listening sections assess skills specific to the individual modality, whereas the speaking and writing sections contain some tasks that assess integrated skills across two-or-more modalities. The reading section includes academic reading passages with associated item sets. The listening section includes both academic lectures and conversations, with associated item sets. The speaking and writing sections include both independent and integrated tasks.
Scores for each section are reported on a scale that ranges from 0 to 30. A total test score is provided as a sum of the scale scores from each section and ranges from 0 to 120. Score reliabilities are 0.87 for both the Reading and Listening sections, 0.86 for the Speaking section, 0.80 for the Writing section, and 0.95 for the total test (ETS, 2018a).
Test taker background information
TOEFL iBT test takers are presented with a set of questions designed to collect information such as prior exposure to English within and outside of the classroom, and intended score use. The following four background information questions (BIQs), previously shown to be associated with EFL/ESL test performance (Manna & Yoo, 2015), were used in this study:
How much time have you spent studying English?
How much time have you attended a school, college, or university in which content classes (such as mathematics, history, or chemistry) were taught in English?
Have you lived in a country where English is the main language?
What is your reason for taking the TOEFL iBT test?
Data
For this study, we drew data from three years of TOEFL iBT test administrations (2014, 2015, and 2016). As the interest of the study was time to reaching a designated criterion score level from a fixed starting point, only those who tested for the first time during this period were included in the overall pool of test takers of interest in the study. Additionally, because an important focus of the study was on the test taker background characteristics that may be associated with differences in score achievement rates, only those with no missing data on the test taker characteristics of interest were included.
All test takers that met the screening criteria were retained for sample selection regardless of score level obtained at the end of the study period. The resulting screened data comprised the final pool of valid cases from which a random sample of 100,000 was drawn 3 . All test records for the sampled test takers were used in the analyses, for a total of 129,107 test records.
Table 1 provides the scale score summary statistics by year for the analysis sample and the valid pool of test takers from which the sample was drawn. Also, included for comparison purposes is the overall pool from which the valid pool was obtained, as well as the total population of all test takers for that year, regardless of when they first tested (Educational Testing Service, 2014, 2015, 2016). A comparison of these summary statistics suggests that, on average, the overall pool is similar in performance to the population. However, as a result of data screening, the valid pool and analysis sample are somewhat higher performing than the general population of test takers for the 2014–2016 administrations.
Scale scores by year.
All who tested in the designated time period.
Population members who tested for the first time in the designated time period.
Overall pool members who met the screening criteria (i.e., first testing within study period and complete data on covariates of interest).
Table 2 shows the sample and valid pool response distributions for the four background questions used in this study. To support analysis with smaller groups, we collapsed several categories as indicated in the table notes. As can be seen, for each of the four questions, the sample and valid pool response distributions were very similar.
Background questions: response percentages over all testing occasions and at first testing.
Includes 2 response options: Less than 1 year; 1 year or more, but less than 2 years.
Includes 2 response options: 2 years or more, but less than 3 years; 3 years or more, but less than 5 years.
Includes 2 response options: Yes, for less than 6 months; Yes, for 6 months to 1 year.
Includes 3 response options: Yes, for more than 1 year but less than 2 years; Yes, for 2 years or more, but less than 3 years; Yes, for 3 years or more, but less than 5 years.
Table 3 provides summary information on test score and age at first testing, gender, native language type (IE or NIE), and repeater status for both the analysis sample and valid pool. As shown, they are very similar as pertains to the tabled characteristics. A review of Table 3 reveals that most tested only one time in the study period (approximately 82%), and the number of test takers decreased rapidly across repeated testing occasions.
Sample characteristics.
Number of days between consecutive test occasions. Sample range 12–1056 days. Valid range 12–1071 days.
Number of days from first test occasion. Sample range 12–1071 days. Valid range 12–1072.
Table 4 provides the top 10 most frequent native country and native language combinations at first testing for the analysis sample and valid pool. As can be seen, the ordering and relative percentages are similar for each of the native country/native language combinations.
Top 10 native country and native language combinations at first testing.
Analyses
To address the research questions of interest, we modeled time to achieving a target criterion score level using an extension of the Cox regression model (Cox, 1972; Cox & Oakes, 1984) that accommodates time-varying covariates (Allison, 2010; Singer & Willett, 2003). For this study, the criterion scores were based on performance levels articulated in the Common European Framework of Reference for Languages (CEFR, Council of Europe [2001]). The CEFR defines six levels of performance that range from A1 (lowest) to C2 (highest). TOEFL iBT test scores were mapped to the CEFR to provide cut scores, or the minimum test scores required for each of the corresponding CEFR levels measured by the TOEFL iBT, specifically A2, B1, B2, and C1 or Above (Papageorgiou et al., 2015). Two levels of criterion scores, representative of commonly used criteria for university admission, were examined in separate models: For the first model, achieving the target criterion score was defined as achieving a TOEFL total test scale score at the CEFR Level B2 or above (72); the second modeled meeting the total test scale score criterion for the Level C1 or above (95). Under this formulation, experiencing the event was represented by meeting the designated criterion, “survival” represented not meeting the criterion, and the time between testing occasions was measured in days. Note, for a given test taker, the event of meeting the criterion can occur repeatedly throughout the study period.
The Cox regression model is extremely flexible. It is one type of survival model that accommodates continuous time events, and can be used to model one-time (singular) or repeated events. Covariates may be continuous or categorical and may be constant as in the proportional hazards model or time varying as in model extensions such as the non-proportional hazards model used in this study.
The general function can be expressed as follows:
where H(tij) represents the cumulative hazard (or odds) of experiencing the event for individual i at time tj, that is, the risk that that the event occurs at time tj that has accumulated over time. X1 through XP represent a set of P generic covariates, with subscripts indicating covariate values specific to individual i at time j. If the covariate is constant, subscript j is not needed. H0(tj) provides the baseline function, that is the cumulative hazard of experiencing the event at time tj when all values of the covariates are equal to 0. In the case of repeated events, the model can be specified to estimate the hazard as a function of time since the last event or since the start of the study period (single origin). The research question will determine which is most appropriate.
In this study, the function represents the likelihood that a test taker achieves a designated criterion score level (B2 or C1) at time t. Because the focus was time to achieving a designated criterion score level with time starting at first testing, a single origin was specified.
The resulting regression coefficients (β1, β2,. . ., βP) quantify the effects of individual covariates on the likelihood of experiencing the event. Each coefficient is an estimate of the effect of a one-unit difference in the associated covariate, holding all other variables constant, and the sign of the regression coefficient indicates the direction of the effect. The value of the exponentiated coefficient, eβ, is referred to as the hazard ratio and provides the effect size of the covariate; that is, the effect of a one-unit difference in the associated covariate on the likelihood of experiencing the event. Thus, for a given covariate, a coefficient value greater than 0 (equivalent to a hazard ratio greater than 1) indicates a positive association with event probability, whereas a coefficient value less than 0 indicates a negative association with event probability.
Evaluating results
There are several aspects to evaluating results. This includes assessing model fit, global statistical significance of the model, and statistical significance of the regression coefficients.
Assessing model fit
Goodness of model fit is evaluated using three criteria: −2LL, AIC, and BIC. The Cox regression model is fit using a partial maximum likelihood method, and the log-likelihood (LL) statistic is used to assess model fit. (In the case of competing models, an increase in LL indicates relatively better fit.) Multiplying LL by −2 produces a statistic with a chi-square distribution under the null hypothesis that all parameters in the current model are 0.
When comparing nested models, a difference in the −2LL statistics has a chi-square distribution with degrees of freedom equal to the difference in number of parameters.
For comparison of two non-nested models, the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), also known as the Schwarz Bayesian Criterion (SBC), are used, with better fit indicated by smaller values.
Assessing global statistical significance of the model
Typically, three alternative tests are used to evaluate the overall significance of the model: the likelihood-ratio test, the Wald test, and score logrank test. Although the three methods are asymptotically equivalent, the likelihood ratio test is more robust for small sample sizes and is therefore generally preferred.
Statistical significance of the regression coefficients
The Wald statistic is also used to test whether a coefficient is significantly different from 0.
Implementation
As noted above, survival analyses were conducted for two different designated score level criteria, defined by the total test scale scores that met or exceeded the TOEFL cut scores for CEFR levels B2 and C1, 75 and 92, respectively. For each criterion event, we examined two sets of covariates of interest in three successive models. The first model examined demographic covariates (gender, age, language type), the second model examined BIQs, and the third model examined both demographic variables and BIQs. Given that repeaters would have multiple records, an indicator of order over repeated testing occasions, Testing Order, was included in all models. Gender and language type were treated as fixed (time invariant), whereas testing order, age and BIQ responses were treated as time varying. BIQs associated with time (i.e., Time studying English, Time in content class in English, Time in English language country) were treated as ordinal. For the categorical variable, BIQ Reason for taking TOEFL, the reference level was Reason 10 Other.
Survival analyses were conducted using SAS PROC PHREG (SAS Institute Inc., 2014), using the COVSANDWICH option that implements a robust variance estimator method to correct for dependence among the multiple observations nested within subject. To deal with the computational complexities posed by observations that have the same event times, the Efron (1977) method, which uses a numeric approximation to all possible orderings, was used for ties (Singer & Willet, 2003).
The main analyses were conducted using all sampled cases, including those who did not achieve the criterion score within study period (right-censored). As noted earlier, data were screened to include only those who tested for the first time during the study period, thus there were no sampled cases that might have achieved the criterion score prior to the start of the study period (left-censored).
Results
Results are provided in Tables 5–8. Table 5 provides the number tested at each repetition, the percentage of those tested who reached or exceeded the criterion score (CEFR Levels B2 and C1), and the median number of days from the first testing occasion, denoted as Mdn days duration, for those who met or exceeded the criterion score. As can be seen, the number of test takers decreased rapidly across repeated testing occasions with a consistently greater percentage scoring at or above the B2 level than at the C1 or above level when the number of test takers was greater than one. Moreover, upon repeated testing, the B2 level was typically achieved in less time than the C1 level. For example, among the 100,000 first-time test takers, 75.28% reached the B2 level or above at first testing, and 37.64% reached level C1. Among the 17,967 who tested a second time, 72.16% reached level B2 or above, with median time from first testing of 78 days, and 24.75% reached level C1 with median time from first testing of 92 days.
Sample repeat statistics.
Indicates order of test/retest occasions.
Indicates median number of days from first test occasion.
Model building results, CEFR Level B2 or above (total score 72+).
Pval < .001 considered statistically significant.
Gender coded Female = 1, Male = 0.
Native language type (LangType) coded Indo-European = 1, Non-Indo-European = 0.
TimeStudyEng through R9_Immigation are background questions shown in Table 2. For R1_2rySchool – R9_Immigration, the reference value is R10_Other.
−2LL for the B2 null model is 2,021,877.86.
Model building results, CEFR Level C1 (total score 95+).
Pval < .001 considered statistically significant.
Gender coded Female = 1, Male = 0.
Native language type (LangType) coded Indo-European = 1, Non-Indo-European = 0.
TimeStudyEng through R9_Immigation are background questions shown in Table 2. For R1_2rySchool - R9_Immigration, the reference value is R10_Other.
−2LL for the C1 null model is 968,700.10.
Final results, CEFR levels B2 and C1.
Pval < .001 considered statistically significant.
Gender coded Female = 1, Male = 0.
Native language type (LangType) coded Indo-European = 1, Non-Indo-European = 0.
TimeStudyEng through R9_Immigation are background questions shown in Table 2. For R1_2rySchool – R9_Immigration, the reference value is R10_Other.
−2LL for the B2 null model is 2,021,877.86. −2LL for the C1 null model is 968,700.10.
Tables 6 and 7 provide the results of sequential model fitting with the event of interest defined as reaching or exceeding CEFR Level B2 or CEFR Level C1, respectively.
To illustrate the model building process for CEFR Level B2, results for the three sequential models are provided in Table 6. As noted earlier, to account for repeated testing occasions, all three models included repeated testing order (Testing Order). Model 1 included demographic variables age, gender (coded Female = 1, Male = 0), native language type (LangType, coded Indo-European = 1, Non-Indo-European = 0). Model 2 included the four BIQs (time studying English, time spent in content classes taught in English, time spent in a country where English was the main language, and reason for taking TOEFL). Model 3 included both demographic variables and BIQs. As shown in the last row, the likelihood ratio test (LR) is statistically significant for all three models, thus we reject the null hypothesis that all parameters in the model are equal to 0. A comparison of fit statistics (−2LL, AIC and SBC) for the three models reveals a pattern of decrease across models, indicating increasingly better fit to the data. Review of model-building results for CEFR Level C1 provided in Table 7 indicates a similar pattern of model fit. Given these results, Model 3 was selected as the final model.
Table 8 presents a side-by-side comparison of the final models for Levels B2 and C1, as well as the corresponding estimated hazard ratios (HR) for each covariate.
To assist in interpreting the tabled results and to illustrate the similarities and differences across test levels, Figure 1 provides a visual comparison of the tabled hazard ratios associated with each covariate for the B2 and C1 levels. In this figure, the vertical axis represents the magnitude of the hazard ratios. As noted earlier, a hazard ratio greater than 1 indicates a positive association with event probability, whereas a value less than 1 indicates a negative association with event probability. Thus, relatively greater values indicate a relatively greater risk of experiencing the event (scoring at the B2 or C1 level).

Hazard ratios.
As can be seen, results differ somewhat by CEFR level. As noted previously in review of Tables 6 and 7, Model 3 is statistically significant at the global level for both B2 and C1. However, there is a notable difference in model fit, as indicated by the relatively smaller fit statistics for C1 in comparison to B2. This is the case for the null model as well as the covariate model (see table note). With respect to the covariates, a review of the parameter estimates and associated hazard ratios indicates more similarity than difference across levels as pertains to magnitude, direction, and level of statistical significance.
For example, for both B2 and C1, there is a statistically significant positive association with testing order. Note, however, that as reflected in the sample repeat statistics provided in Table 5, the relative magnitudes of the hazard ratios (HR) indicate that the effect is somewhat greater for B2: specifically, holding all other variables constant, repeat test takers have a 22.2% greater chance of scoring at the B2 level with each successive testing, in contrast to 18.7% for scoring at the C1 level.
For both B2 and C1, the variables with the largest positive associations for reaching the criterion score levels are language type, time studying English, and taking TOEFL for admission to non-business or business graduate and postgraduate programs (R4_Grad, Post NB; R5_Grad, Post B). As indicated by the relative magnitude of the hazard ratios, the effects of these covariates are somewhat greater at the C1 level. For example, for C1, native IE speakers have a 72.2% greater likelihood of scoring at the C1 level in comparison to NIEs, holding all other factors constant. In comparison, for the B2 level, native IE speakers have a 42.1% greater likelihood. The largest negative associations are taking TOEFL for admission to secondary school and 2-year or junior college (R1_2rySchool, R2_2yr, Jr Col).
As illustrated in Figure 1, there are a number of variables that differ in terms of the association with scoring at the B2 and C1 levels. For example, there is a small statistically significant negative effect for gender at the C1 level but not at the B2 level. Holding all other variables constant, females have a lesser probability of scoring at the C1 level than males; specifically, there is a difference of −4.5% for females. In contrast, the difference in the probability of scoring at the B2 level is less than 1% [0.2%] and is not statistically significant. The association with scoring at the B2 and C1 levels differed somewhat in terms of magnitude, direction, or statistical significance for four of the reasons for taking TOEFL: admission to English language school or program (R6_EngLangSchl), licensure or certification (R7_Licens,Cert), employment or job (R8_Employ,Job), and immigration (R9_Immigration).
To explore these results further, survival curves based on the estimated survival functions from the final Cox models are provided in Figures 2–5. Figures 2 and 3 provide the survival curves at the B2 and C1 levels, respectively, by language type for the sample’s typical test takers. The typical test taker is one who reflects the mean age, mean number of times testing, male–female proportion, and modal responses to the BIQs. In these curves, the vertical axis represents the probability of survival (i.e., not scoring at the designated criterion level, B2 or C1) and the horizontal axis represents time in days since first testing. The relative height and shape of each curve indicates the survival rate for a given group as a function of time, with lower points on the curves indicating a greater likelihood of experiencing the event (scoring at the designated criterion level), and higher points on the curves indicating a greater likelihood of surviving (not experiencing the event). It is worth noting that although the estimated curves appear to reach the x-axis, not everyone achieved the designated criterion levels.

Estimated survival curves by language type: Level B2.

Estimated survival curves by language type: Level C1.

Estimated survival curves by gender and language type: Level B2.

Estimated survival curves by gender and language type: Level C1.
A review of the curves reveals similarities and differences. At both levels, the likelihood of survival (i.e., not scoring at the B2 or C1 level), as represented by the relative heights of the curves, is greater for those test takers whose native language is non-Indo-European (NIE) in comparison to those whose native language is Indo-European (IE). As expected, in comparison to B2, the C1 curves for each language group reflect initial survival rates that are higher and decrease more slowly, indicating that holding all other variables constant, the typical test repeaters of both language groups are more likely to achieve a score at the B2 level and do so more quickly than at the C1 level.
Figures 4 and 5 provide the survival curves at the B2 and C1 levels, respectively, by language type and gender for the sample’s typical test taker, that is one who reflects the mean age and number of times testing, and modal responses to the BIQs.
A review of the curves also reveals similarities and differences by level. Consistent with Figures 2 and 3, at both levels, the likelihood of survival (i.e., not scoring at the B2 or C1 level), as represented by the relative heights of the curves, is greater for those test takers whose native language is non-Indo-European (NIE) in comparison to those whose native language is Indo-European (IE).
At the B2 level, the curves for males and females coincide for both IE and NIE language types, indicating minimal, if any, gender difference in B2 level score achievement rates. In contrast, at the C1 level, the curves are higher for females (F) for both IE and NIE language groups, indicating that females are more likely to survive (less likely to achieve a score at the C1 level) than males (M). In comparison to B2, the C1 curves for males and females within each language group reflect initial survival rates that are higher and decrease more slowly, indicating that holding all other variables constant, the typical test takers of both language groups and genders are more likely to achieve a score at the B2 level and do so more quickly than at the C1 level.
Discussion
In this study, we used survival analysis to provide a glimpse into patterns of test performance for test takers who vary on a number of factors including number of times testing, age, gender, native language type, time spent with English (in study, in content classes taught in English, in a country where English is the main language), and reason for taking TOEFL iBT.
Results of this study indicate that the majority of the test takers in our sample (approximately 82%) tested one time in the study period and the number who retested decreased rapidly across repeated testing occasions. At first testing, approximately 75% score at or above Level B2, and 38% score at Level C1. For those who did not achieve the designated criterion scores at first testing, holding all other covariates constant, the likelihood of achievement increases with repeated testing, and the effect is somewhat greater for the B2 criterion in comparison to C1. Moreover, as pertains to time to reaching the designated criterion scores, upon repeated testing, the B2 level was achieved in less time than the C1 level.
The covariates with the largest positive association with scoring at both the B2 and C1 levels are language type, time studying English, and Reasons 4 and 5, taking TOEFL for graduate or postgraduate admission (non-business or business program). The covariates with the largest negative association are Reasons 1 and 2, taking TOEFL for admission to secondary school or admission to a 2-year or junior college.
These results reflect the differential difficulty of the two criteria, and suggest that greater time is needed for improvement at the more stringent C1 level, and the required time may vary by test taker characteristics and reasons for testing. Although most of these results are not completely unexpected given the intended target population and test use, they may nevertheless provide useful information pertaining to factors related to achieving targeted levels of EFL proficiency at the total test level. A less expected result was that the association of gender differed across levels, with essentially no gender difference at the B2 level and females of both language types less likely to achieve the C1 level. This result warrants additional investigation.
Like most studies, there were several limitations. Test taker background information was self-reported and as such was not complete for all records in the data. This information was an important focus of the study, and the sample was selected from those with complete information. Because it is quite possible that those with missing information may differ somewhat from those with complete information, their exclusion may have impacted the results. Investigation into the optimal approach for incorporating cases with missing information will be included in future research in order to increase sample inclusivity and thus the generalizability of study results.
In the context of this study, the decision to retest is self-determined and may be influenced by a number of factors, including the evaluation of achieved scores relative to targeted levels of proficiency as well as the availability of resources (educational, financial, cultural, contextual) to inform whether and when to retest and to provide support for retest opportunities. And while changes in retest scores may reflect changes in proficiency levels, it is also possible that a test taker reached a targeted level of proficiency between testing occasions. However, without an alternative measure of proficiency such as class grade, this will not be observed due to limitations associated with self-determined testing date. To the extent that TOEFL iBT test takers may choose whether and when to retest, it is important to determine whether changes in performance across self-determined repeat testing reflect changes in the construct of measurement interest (English proficiency), an essential aspect of score validity, as well as the extent to which results would generalize across self- and externally-determined repeaters. To this end, future research may examine factors associated with retest decisions and their association with test performance.
With respect to educational implications, there are a number of additional factors associated with English language proficiency and test performance that were not available in the study data. These include but are not limited to indicators of time spent on test preparation, hours of additional study outside of language classes, type and intensity of language instruction, the timing and recency of instruction relative to first and subsequent testing occasions, and intended score use. Using these types of educational variables, future research is needed to examine the effects of what transpires between testing occasions to inform instruction and to support test takers in their quest to reach proficiency. Future research is also needed to investigate the extent to which effects may differ by language modality.
Additionally, given the diversity of the TOEFL test-taking population and the significance of language type in achieving the designated criteria, research is needed to examine time to proficiency and testing patterns within and across different cultural, economic, and/or educational settings. To this end, future research may focus on individual educational contexts, or may entail the use of hierarchical survival models to provide important insight into similarities and differences in factors associated with performance improvement within and across settings.
Despite these limitations, this study contributes to a larger body of research on factors associated with achieving target levels of performance on tests of English as a foreign language, in general and as pertains to TOEFL iBT. From a methodological perspective, this study contributes to a growing corpus in the application of survival analysis to investigate test performance and, to the best of our knowledge, is the first to apply it to the international EFL context.
Supplemental Material
Survival_Analysis_Study_Supplement_Sample_2_Tables – Supplemental material for Time to achieving a designated criterion score level: A survival analysis study of test taker performance on the TOEFL iBT® test
Supplemental material, Survival_Analysis_Study_Supplement_Sample_2_Tables for Time to achieving a designated criterion score level: A survival analysis study of test taker performance on the TOEFL iBT® test by Lora F. Monfils and Venessa F. Manna in Language Testing
Footnotes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
