Abstract

Test purpose: The primary purpose of the Test of English for International Communication (TOEIC®) is to measure the everyday English skills of individuals, who speak a first language other than English, working in an international environment (ETS, 2015a, 2016a; Powers & Powers, 2015). The TOEIC also has six secondary purposes: (1) to verify the current level of English language proficiency; (2) to qualify for a new position and/or promotion in a company; (3) to enhance professional credentials; (4) to monitor progress in English; (5) to set learning goals; and (6) to involve employers in advancing English ability (ETS, 2015a, 2016a, p. 2). Embedded in these purposes are five approved intended uses that are explicitly recommended for TOEIC scores: hiring, placing, promoting applicants, measuring English language proficiency, and evaluating progress in English (ETS, 2015a, 2016a).
Length, administration, and price: The TOEIC tests consist of the Listening and Reading (TOEIC LR) and Speaking and Writing (TOEIC SW). Test-takers have the option to take both the TOEIC LR and the TOEIC SW, the TOEIC LR with either the TOEIC Speaking or the TOEIC Writing, or just the TOEIC Speaking or TOEIC Writing individually (Liu & Costanzo, 2013). The TOEIC LR is a paper-and-pencil test, whereas the TOEIC SW tests are computer-based. The TOEIC LR takes two hours in total (45 minutes for the Listening section and 75 minutes for the Reading section). The TOEIC SW lasts approximately 20 minutes for the Speaking and one hour for the Writing. The TOEIC tests, available throughout the world, are designed by Educational Testing Service (ETS) in the United States and administered by ETS Preferred Network Associates around the world. Indicative costs of TOEIC tests administered by YBM Sisa, an ETS Preferred Network Associate in South Korea (hereafter, Korea), are approximately 40 US dollars for the TOEIC LR and 70 US dollars for the TOEIC Speaking and Writing each.
Scores: Scores for the TOEIC LR are each reported on a scale of 5–495, while scores for the TOEIC SW are each reported on a scale of 0–200. ETS provides English language proficiency level descriptors corresponding to ranges of the scaled scores for all TOEIC tests (see www.ets.org/toeic/). As the TOEIC LR is a norm-referenced test, test score reports for this section provide a percentile rank that helps test-takers identify their standing against the past three-year pool of test-takers. The report card also provides each test-taker’s percentage of correct answers corresponding to the intended test constructs and the average percentage of other test-takers’ correct responses in the same test sitting. All TOEIC LR and SW results are available in paper form; electronic/online reports are available in some countries (depending on the ETS Preferred Network (EPN) Associate).
General description
The first version of the TOEIC LR was developed by ETS in 1979 to evaluate employees’ and prospective employees’ English skills in international business contexts in response to a request from the Japanese Ministry of International Trade and Industry (MITI) (ETS, 2013). A revised version of the TOEIC LR was released in 2006. In the same year, the TOEIC SW tests were introduced to address test users’ concerns that some employees lacked English speaking and writing skills despite their high scores in the TOEIC LR (Chapman & Newfields, 2008; ETS, 2010; Powers, Kim, & Weng, 2010; Powers & Powers, 2015). Both the revised TOEIC LR and the introduction of the TOEIC SW were carried out according to the principles of Evidence-Centered Design (ECD) (see Mislevy, Steinberg, & Almond, 2002, 2003). ECD provides steps for collecting evidence regarding what is being measured to support claims of test-takers’ abilities, based on their test performance.
The popularity of the TOEIC tests has been evident for over 30 years, and recently demand has grown dramatically. The TOEIC LR is now used by English language learning programs and government agencies in over 160 countries, as well as by roughly 14,000 companies around the world (ETS, 2016c, para. 4). Furthermore, although the TOEIC SW tests are much less popular than the TOEIC LR, the number of people in Japan taking the former has radically increased from 1,200 in 2006 to 24,000 in 2014 (Institute for International Business Communication, 2014). In Korea, the number of test-takers for the TOEIC Speaking has soared dramatically from 15,000 in 2007 to 260,000 in 2012 (Korea TOEIC Council, 2013).
In November 2015, ETS announced further revisions to the TOEIC LR format, although the test retained the same score scale, level of difficulty, total number of items, and length. The most updated version of the TOEIC LR was released to Japan and Korea in May 2016, and was planned to be rolled out elsewhere throughout 2017. The following sections review both this updated version (released in 2016) and the current form of the TOEIC SW.
Listening and reading
The updated version of the TOEIC LR consists of seven parts: four for the Listening section and three for the Reading section (see Table 1). Each section of the Listening and Reading includes 100 multiple-choice questions (MCQs), for a total score of 495 points. The Listening section is designed to measure listening for understanding of details, and for making inferences of gist, purpose, and basic context. The Reading section includes items targeting grammar, vocabulary, understanding of specific information, and connecting and synthesizing information.
Description of each question in TOEIC Listening and Reading (ETS, 2017).
Although monologic prompts are provided in Parts 1, 2, and 4, more than three interlocutors who have different English accent varieties (American, British, Canadian, Australian, or New Zealand) are included in Part 3 conversations. Furthermore, colloquial forms, such as gonna for going to, and fragments of full sentences, such as down the hall or could you?, appear in some conversations in the Listening section (ETS, 2015c, November 5). In the Reading section, text messages and online chat dialogues reflect the types of communication that are now commonly used in global workplaces (ETS, 2015b).
Speaking
The TOEIC Speaking consists of 11 items grouped into six different tasks. These items are sequenced in increasing levels of complexity and difficulty (see Table 2). The Speaking test is designed to assess the following three claims: (1) whether test-takers can produce speech intelligible to native and proficient non-native speakers of English; (2) whether they can produce appropriate language for routine interactions; and (3) whether they can create connected discourse appropriate for the workplace (ETS, 2013, 2016a; Hines, 2010). A scale of 0 to 3 is used for Questions 1 to 9 and a scale of 0 to 5 for Questions 10 and 11 (Liao & Wei, 2010).
Writing
The TOEIC Writing has eight items organized into three tasks (see Table 3). These tasks assess the following three claims: (1) whether test-takers can produce sentences comprising simple and complex sentences; (2) whether they can generate multi-sentence-length text; and (3) whether they can produce multi-paragraph-length text supported by reasons, evidence, and explanations (ETS, 2013, 2016a; Hines, 2010). As with the TOEIC Speaking, the three tasks are rated on different scales: a scale of 0 to 3 for the first task; a scale of 0 to 4 for the second task; and a scale of 0 to 5 for the third task (ETS, 2016a).
Overall evaluation
The appraisal of the TOEIC tests is based on contemporary practices in test validation, that is, an argument-based approach to validation, specifically Kane’s (2013) interpretation and use argument (IUA). Kane’s (2013) IUA is based on Toulmin’s (1958, 2003) argument model and provides a practical guideline for test validation in terms of evaluating how well claims based on test scores are supported or challenged by evidences collected during validation. In the following sections, four test qualities and practices are categorized for analytical evaluation of the TOEIC tests: scoring, generalization, extrapolation, and decision rules. Each starts with an explanation at the beginning and then evaluation of each quality will be followed.
Scoring
Scoring in Kane’s (2013) IUA pertains to the adequacy of a scoring rubric and scoring procedures. Judgments of experts who were involved in designing and developing a scoring rubric as well as empirical evidence for the consistency and accuracy of scoring procedures generally can be provided as evidence for the claim of scoring.
To enhance the accuracy and consistency of scoring, supporting evidence has mainly been collected through statistical analyses across TOEIC administrations. For example, appropriate item difficulties and effective discrimination between high- and low-proficiency test-takers have been reported in the revised version of the TOEIC LR released in 2006 (see Liao, Hatrak, & Yu, 2010).
Along with the TOEIC LR, scoring rubrics for the TOEIC SW were developed in terms of the scale of the scoring rubrics by the TOEIC design team, and further revised with input from raters to be able to provide accurate and consistent TOEIC SW scores in accordance with the claims based on the ECD approach (Hines, 2010). One of the notable efforts made by ETS is to use the “Online Scoring Network” (ETS, 2010, p. 6) for the TOEIC SW to enhance the accuracy and consistency of the raters’ scoring. In this scoring network, each rater is only allowed to rate the responses of each test-taker on one type of question at a time. Also, the rater is prevented from scoring more questions from a test-taker if the rater has already reached the maximum number of questions from that test-taker (Everson & Hines, 2010). This procedure ensures that each test-taker’s responses are scored by multiple raters, enhancing overall reliability and fairness (Everson & Hines, 2010). Analysis of scoring data between September 2012 and January 2013 shows that these systematic scoring procedures have led to average agreement rates for scoring consistency of each item in the TOEIC SW that are mostly over 99% (Qu & Ricker-Pedley, 2013).
ETS provides adequate evidence to support the validity in scoring. However, test users may anticipate studies on the item difficulty and discrimination of the updated version of the TOEIC LR, as items have been significantly revised.
Generalization
Generalization refers to estimates of consistency of test scores over test tasks and testing contexts (Kane, 2013). ETS has provided some reports regarding the internal consistency of TOEIC scores from a single administration as well as the reliability of the scores from multiple administrations (i.e., test–retest reliability) to support the claim of generalization. The most recently reported internal consistency of the revised version of the TOEIC LR combined was approximately .90 (KR-20) and the standard error of measurement (SEM) was computed as ±25 scaled score points (ETS, 2013). The internal consistency of individual questions within each of the mentioned three claims in the TOEIC Speaking was reported as .80 (ETS, 2010). However, the internal consistency was not calculated for the TOEIC Writing because there was only one question (i.e., question 8) for the third claim (ETS, 2010).
Regarding the test–retest reliability of TOEIC scores, raw scores of the TOEIC LR are converted into scaled scores through equating in order to be comparable across different test administrations (ETS, 2016b). The test–retest reliability of scores of the TOEIC Speaking was reported to be approximately .80, based on the data of 16,867 test-takers, and the test–retest reliability of TOEIC Writing scores was .82 based on the data of 6,199 test-takers (Liao & Qu, 2010). It should be noted that the test–retest reliability for the TOEIC SW was based on data collected from two different forms of the TOEIC SW, administered between December 2006 and December 2008 in Korea (Liao & Qu, 2010).
ETS has put much effort into ensuring the internal consistency and reliability of TOEIC scores to provide test users with a reliable measure to help them make valid decisions about test-takers. However, the available data, when analyzed, were found to be primarily based on two testing contexts, namely Korea and Japan. The reason for this finding may be that those two countries are the main consumers of the TOEIC. Evidence for consistent TOEIC scores across different testing contexts (e.g., Brazil and Taiwan) would be a useful addition to the body of TOEIC research.
Extrapolation
Extrapolation pertains to the extent to which test scores predict the performance in the target domain (Kane, 2013). ETS has collected validity evidence for extrapolation, primarily from Korea and Japan, using test-takers’ self-assessments of their language skills in performing a variety of English communication tasks (ETS, 2010, 2013; Powers, Kim, Yu, Weng, & VanWinkle, 2010; Powers & Powers, 2015). With a large sample size of test-takers (e.g., 6000 for the TOEIC LR, 3518 for the TOEIC SW, and 2300 for the four skills) in the 2006, 2008, and 2015 studies, the correlations between self-reported responses on each skill and related TOEIC scores ranged from .40s to .50s, which are regarded relatively strong in social science research (Cohen, 1988). However, it remains unknown whether test-takers’ self-assessments from Korea and Japan can be sufficiently applied to other contexts (e.g., Brazil and Taiwan) where TOEIC scores are also widely used. Further, observations and evaluations of potential and current employees’ performances in the actual business contexts by other stakeholders, such as employers, would provide additional evidence to supplement the test-takers’ self-assessments.
The other concern is that the TOEIC Speaking may not be capturing certain aspects of communication that are vital in international workplaces. Some empirical studies on English communication clearly demonstrate that interactive communication skills (e.g., meaning negotiation, accommodation, and repair) are vital in international business workplaces (Kankaanranta & Louhiala-Salminen, 2013; Louhiala-Salminen, Charles, & Kankaanranta, 2005; Louhiala-Salminen & Kankaanranta, 2011). However, given that the TOEIC Speaking is delivered in a single candidate format through a computer, it may not fully measure these communication skills to an adequate degree. There has been little research so far regarding how to engage a wider range of communication skills within the TOEIC Speaking, nor how well TOEIC scores reflect actual communication performance in the target domain. This should be an important research agenda for future TOEIC research. The fitness of the language test for its target context is a fundamental consideration in test validation (Elder & Harding, 2008). It would therefore be worthwhile to consider including features of English that are used in international business workplaces in the TOEIC, in order to ensure more valid interpretations and decisions about applicants’ English abilities in international business contexts.
Decision rules
Decision rules are evaluated based on “overall consequences (or utilities or values)” (Kane, 2013, p. 47) in accordance with test purposes. ETS has undertaken to fulfill the purposes of the TOEIC to bring about beneficial outcomes for all stakeholders (ETS, n.d.). To achieve its intended outcomes of the TOEIC, ETS carried out differential item functioning (DIF) analysis to detect any items or ratings that may exhibit bias for the TOEIC LR (e.g., Yoo & Manna, 2017) and the TOEIC SW (e.g., Qu & Ricker-Pedley, 2013). However, there may be a need for clearer communication around the distinction between the six secondary purposes and the five approved intended uses because test users may not grasp this distinction clearly. For example, one of the six secondary purposes of the TOEIC is for test takers to qualify for employment. This purpose may have led to unapproved uses in Taiwan and Korea. Taiwanese universities and colleges employ TOEIC LR scores as a graduation requirement to help their students prepare to qualify for better employment opportunities after graduation (Hsieh, 2017). Similarly, almost half of Korean universities (i.e., 99 out of 220 universities as of 2013) used TOEIC LR scores for university admissions with a rationale regarding employment prospects after graduation (Im & McNamara, 2017). However, these uses are not specified as approved uses of the TOEIC LR. In addition, it is important that separate validation arguments be developed for each independent purpose and/or use (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014; Kane, 2013); the inferences from the results of a single test may fit certain purposes, but not others (Newton, 2010).
However, the number of empirical studies on the impact and washback of the TOEIC use is limited (see Booth, 2012; Choi, 2008) despite the widespread use of the TOEIC particularly in Korea and Japan. the TOEIC has been found to encourage practice and stimulate learning (Jung, 2010), and to motivate students to improve their English language proficiency (Newfields, 2005). However, the multiple choice format has been shown to lead to negative washback in the Korean classroom (Choi, 2008), and to lead to a focus on test preparation strategies in Japan (Newfields, 2005). Further evidence for washback of the TOEIC should be collected on a continuous basis across various contexts with multiple stakeholders, especially considering the recommendations for multiple uses of the TOEIC.
Conclusion
The Test of English for International Communication (TOEIC) has been successfully delivered for more than 30 years. The test serves the primary purpose of measuring the everyday English skills of individuals, who speak a first language other than English, working in an international environment (ETS, 2015a, 2016a). This purpose has been well achieved by using a very sophisticated method of domain analysis (i.e., the ECD approach) and by providing consistent test results across administrations of the TOEIC. An ongoing challenge is that there is a growing need to expand TOEIC constructs to fit the real-world language demands of international workplace contexts. In addition, clearer communication on the distinction between the six secondary purposes and the five approved intended uses, using separate validation arguments, should be developed for each independent use (AERA, APA, & NCME, 2014).
Footnotes
Acknowledgements
We express our gratitude to Dr. Luke Harding, the editor of Language Testing, and we thank three research scientists at Educational Testing Service for fact-checking this review.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
