Abstract
Elicited imitation (EI) has been widely used to examine second language (L2) proficiency and development and was an especially popular method in the 1970s and early 1980s. However, as the field embraced more communicative approaches to both instruction and assessment, the use of EI diminished, and the construct-related validity of EI scores as a representation of language proficiency was called into question. Current uses of EI, while not discounting the importance of communicative activities and assessments, tend to focus on the importance of processing and automaticity. This study presents a systematic review of EI in an effort to clarify the construct and usefulness of EI tasks in L2 research.
The review underwent two phases: a narrative review and a meta-analysis. We surveyed 76 theoretical and empirical studies from 1970 to 2014, to investigate the use of EI in particular with respect to the research/assessment context and task features. The results of the narrative review provided a theoretical basis for the meta-analysis. The meta-analysis utilized 24 independent effect sizes based on 1089 participants obtained from 21 studies. To investigate evidence of construct-related validity for EI, we examined the following: (1) the ability of EI scores to distinguish speakers across proficiency levels; (2) correlations between scores on EI and other measures of language proficiency; and (3) key task features that moderate the sensitivity of EI.
Results of the review demonstrate that EI tasks vary greatly in terms of task features; however, EI tasks in general have a strong ability to discriminate between speakers across proficiency levels (Hedges’ g = 1.34). Additionally, construct, sentence length, and scoring method were identified as moderators for the sensitivity of EI. Findings of this study provide supportive construct-related validity evidence for EI as a measure of L2 proficiency and inform appropriate EI task development and administration in L2 research and assessment.
Elicited imitation (EI) is a testing method that usually requires participants to listen to a series of stimulus sentences (or phrases, words, sounds) and then repeat the sentences verbatim (Underhill, 1987). When used to assess language performance, EI allows developers and researchers to customize the target component of language proficiency and the difficulty of the tasks by varying the sentence stimuli (Hood & Schieffelin, 1978). The simplicity and flexibility in task development and administration make EI adaptive to both classroom and standardized assessments and a valuable tool in exploratory research settings (e.g., Henning, 1983; Markman, Spilka, & Tucker, 1975; van Moere, 2012).
EI has been widely used to investigate first language (L1) development (e.g., Fraser, Bellugi, & Brown, 1963; Slobin & Welsh, 1973 for reviews of the use of EI in L1 acquisition), language disorders in children (e.g., Dailey & Boxx, 1979), and neuropsychological activities (e.g., Menyuk, 1964). The underlying assumption of testing with EI is that if the participant has acquired the grammatical features associated with or displayed in the stimuli, it should be easy to repeat the stimuli. Otherwise, repetition will be difficult (Rebuschat & Mackey, 2013).
The simple and flexible characteristics of EI have led to wide variation in its task design (Jessop, Suzuki, & Tomita, 2007); however, task variation, in turn, presents challenges in applying findings of EI studies to enhance L2 learning, teaching, and assessment. The present systematic review contributes to the appropriate use of EI for these purposes. Corresponding to Norris and Ortega’s (2006) recommendation of synthesizing available language tasks in the field of language learning and teaching, this systematic review aimed to do the following: (1) summarize the historical and current state of the development, administration, and use of EI tasks; (2) clarify the construct measured by EI, and, more importantly; and (3) advance discussions towards a more principled practice of EI in L2 research.
Prior to our systematic review, Zhou (2012) reported a synthesis of 24 studies using EI on L2 adult learners and concluded that EI is overall a reliable measure (internal consistency coefficient ranged from .78 to .96, p. 90). In addition, the correlation between EI scores and other measures of language proficiency was higher than .5 in the majority of the studies reviewed (p. 90), which provides some support for the construct-related validity for EI as a measure of language proficiency.
The present review expanded Zhou’s study and employed a two-phase approach by featuring both a systematic narrative review (Phase I henceforth) and a quantitative meta-analysis (Phase II henceforth). The two types of syntheses are complementary and, when combined, offer a more comprehensive account of the subject under investigation than either approach alone (for the definitions and differences between the two, see Ellis, 2015). More specifically, in Phase I, we surveyed the use of EI in L2 research from 76 published and unpublished studies (including all the 24 studies synthesized in Zhou [2012]) in the period of 1970–2014. The survey resulted in a historical review of (changes in) the use of EI alongside with the development of theoretical models of second language proficiency and shifts in validity frameworks. In addition, the historical review outlines debate on the authenticity and the construct measured by EI, serving as the theoretical basis for the quantitative meta-analysis in Phase II.
The meta-analysis in Phase II examined the construct-related validity of EI (i.e., what EI measures), a theoretical question that has been debated overtime among language testers. We examined the theoretical question of what EI measures. Specifically, using a subsample of studies collected for Phase I, we explored whether scores on EI tasks can distinguish speakers across proficiency levels besides typical correlations between scores on EI and other measures of language proficiency (hereafter referred to as sensitivity of EI). If EI is an effective measure of language proficiency, EI scores of higher and lower proficiency speakers should be consistently distinguishable across studies. The inability of EI tasks to distinguish speakers across proficiency levels would indicate that EI might be measuring something different. Finally, we examined the variation in EI task design across studies and its impact on the sensitivity of EI because there have not been established standards or protocols for the use of EI in L2 research. The results of these two phases were then integrated in a discussion that we hope may enable more principled practices for the use of EI to measure L2 proficiency.
Phase I: Narrative review
Phase I of this study was guided by two specific research questions: (1) In what research/assessment context is EI used in L2 research (e.g., target construct, language, and language proficiency levels)?; and (2) How does the design of certain key features of EI tasks vary across studies?
Method
Study selection criteria
The narrative review (Phase I) included all the studies in the period of 1970–2014 (May 2014), which discussed (in length) or used EI as a method to measure global or specific aspects of L2 proficiency. Reports that only mentioned EI but did not discuss the technique in detail were excluded from this phase (e.g., Rebuschat, 2013). The research synthesis began with Naiman (1974), the first documented application of EI in L2 research to measure linguistic competence of young L2 learners of French. The type of documents collected included journal articles, book chapters, dissertations and theses, conference proceedings, technical reports, and book reviews.
Identification of studies
The following steps were taken to locate related EI studies. First, a list of commonly used electronic databases in the fields of applied linguistics and education was used to search for studies that fit the selection criteria mentioned above. These databases include Academic Search Premier, Education Source, ERIC, EBSCO, Google, Google Scholar, JSTOR, IRIS, LLBA (Linguistics and Language Behavior Abstracts), ProQest Dissertation and Theses, PsycARTICLES, PsychInfo, SSCI (Social Sciences Citation Index), and ScienceDirect databases. Keywords used to search for studies were the combinations of two phrases: (a) “elicited imitation” or “sentence repetition” or “sentence recall” or “imitation” or “repetition”, and (b) “second language” or “foreign language”.
Second, both electronic and manual searches were performed for some widely cited journals in applied linguistics and second language acquisition (SLA), including, Applied Linguistics, Applied Psycholinguistics, CALICO Journal, Computer Assisted Language Learning, Foreign Language Annals, Language Assessment Quarterly, Language Learning, Language Teaching Research, Language Testing, Modern Language Journal, and TESOL Quarterly. Finally, reference lists of the identified reports were used to locate any additional studies related to our synthesis.
The literature search process identified 76 studies that either define or use EI in a way aligned with our definition of EI. These studies include 52 journal articles, 12 dissertations, five conference proceedings, five book chapters, and two book reviews. All 76 studies were used for the narrative review (see Figure 1 for a summary of the selection criteria and search process).

Selection and classification of EI studies.
Coding
The coding process of the primary studies consisted of two stages. First, a preliminary set of coding variables were identified, based on the reviews of EI in Bley-Vroman and Chaudron (1994), Gallimore and Tharp (1981), and Vinther (2002), to represent the research / assessment context (i.e., target language, measured construct, target language proficiency levels) and task features of EI (i.e., number of items, stimuli sentence length, implementation of delayed repetition, scoring method, control of linguistic variables). Based on these variables, a coding scheme was developed and then piloted on a sample of articles from the 76 studies. The codes were then discussed among the authors of the study and unclear codes were revised. The coding scheme was finalized after three rounds of tryout of actual coding, discussion, and revision. The specific codes for each category of the final coding scheme for this synthesis are presented in Table 1.
Features of EI tasks coded for Phase I (k = 58 * ).
Articles that had only theoretical discussions were excluded in this table.
Other refers to studies describing language proficiency levels in an institutional approach.
Once the coding scheme was established, it was then made available online to the coders through Qualtrics©, a survey distribution program. All the primary studies (k = 76) were coded independently by the first and third authors of the study. The inter-coder reliability expressed in the percentage of agreement was 94.68%, with close to 100% agreement on most EI task features but 86.42% on control of linguistic features. Cohen’s Kappa coefficient for inter-coder agreement on control of linguistic features was 84.11%. Discrepancies between the two coders were identified and discussed, and final agreement reached 100%.
Results
Research and assessment contexts of EI studies
The use of EI as an L2 proficiency measure has extended across a variety of languages and linguistic constructs, targeting a range of proficiency levels. While the majority of the studies used EI to measure performance in English (k = 34), other target languages included French (k = 7), Spanish (k = 7), Dutch (k = 3), Mandarin (k = 3), German (k = 2), and Japanese (k = 2). It should be noted that there is a lack of standardization in the characterization of participant language proficiency level in studies published in SLA journals (Thomas, 1994, 2006). The two most commonly used approaches are institutional (i.e., grouping based on their assigned curricular or course levels) and impressionistic (i.e., grouping based on impressionistic descriptors, e.g., beginner, intermediate, or advanced). When grouped in the institutional approach, language proficiency levels can vary even among L2 learners within the same course or program level (Tremblay, 2011). That is, students from a higher level course are not necessarily higher in terms of language proficiency than students from a lower level course, especially in the case of heritage speakers. Given this limitation, this study uses the impressionistic approach; however, studies describing language proficiency levels in an institutional approach were retained and coded as Other on the target language proficiency level (see Table 1). Three levels of L2 proficiency were specified to classify participants in this review: high (advanced), intermediate, and low (beginner). The frequency counts shown in Table 1 are evenly distributed across all three language proficiency levels. When distinguishing L2 speakers across proficiency levels, some studies included native speakers as a baseline for comparison (e.g., Erlam, 2006), while others compared EI scores of L2 speakers across proficiency levels (e.g., West, 2012; Wu & Ortega, 2013).
Among the 76 primary studies, 18 studies focused on theoretical discussions of EI with respect to the measured construct and task design (hereafter referred to as theoretical discussion studies; see Figure 1). The other 58 studies used EI tasks to measure a variety of language-related constructs in both experimental and non-experimental settings.
There were 13 quasi-experimental studies that used EI as a learning outcome measure, testing the effect of particular interventions (hereafter referred to as quasi-experimental studies). The interventions included among others: form-focused instruction (Fiori-Agoren, 2004; Kim, 2012), strategies of corrective feedback (Ellis, Loewen, & Erlam, 2006; Erlam & Loewen, 2010; Faqeih, 2012; Li, 2010), explicit instruction (Akakura, 2012; Elliot, 1997), and particular types of teaching approaches (Burger & Chretien, 2001; Trofimovich, Lightbown, Halter, & Song, 2009; Trofimovich, Lightbown, & Halter, 2013). (See Table 2 for a summary of quasi-experimental studies using EI as a measure of language learning outcome.)
Summary of experimental studies using EI as a measure of language learning outcome, 1970–2013.
Not specified means that these studies used an institutional approach to characterizing participants’ language proficiency level.
Observational studies were classified into two types: group comparison studies (k = 30) and correlational studies (k = 15). Group comparison studies featured comparisons of EI scores by speakers of different proficiency levels. In contrast, correlational studies mainly examined the concurrent validity of EI scores as a measure of global language proficiency (e.g., Henning, 1983) by correlating EI scores with scores on other (more established) language proficiency tests such as TOEFL iBT or IELTS (e.g., Erlam, 2006).
The use of EI as a measure of L2 proficiency: A historical review
Popularity of EI in the 1970s and 1980s
The use of EI as a measure of L2 proficiency has undergone interesting shifts over the past few decades, and these shifts accompany shifts in the theoretical models of language proficiency and attendant frameworks associated with discussions of validity. Figure 2 shows the number of L2 EI studies during the period of 1970–2014. In the 1970s and 1980s when EI first appeared in L2 research, EI tasks were mainly used to address linguistic competence, mostly assessing some aspect of grammar (e.g., Markman et al., 1975; Naiman, 1974).

Publication year trend by document type.
However, as structural approaches to defining language proficiency were questioned, growing interest in theoretical and empirical explorations of communicative competence, or more complex and inclusive models of language proficiency was developed (Bachman, 1990; Bachman & Palmer, 1996; Canale & Swain, 1980; Thomas, 1992). Fulcher (2000) referred to this theoretical and methodological shift as the “communicative” movement (p. 483). A concomitant move is that traditional, psycholinguistic language tasks became disfavored among L2 researchers. An examination of the literature in L2 research reveals that, at least in the 1990s, the use of EI and similar general proficiency and psycholinguistic measures (Oller, 1973, 1976), such as dictation and cloze procedure, decreased. Instead, L2 researchers became more interested in assessments with strong face validity and tasks that appear to simulate real-life communication. Consequently, despite its former popularity in language related research, EI was questioned as a useful measure of language proficiency (Vinther, 2002) as authenticity, interactivity, and performance moved to center stage.
Debate on the authenticity and construct-related validity of EI: An interesting shift in the 1990s
A main criticism of EI is related to authenticity. That is, language production prompted through EI tasks is criticized as unrepresentative of natural speech or conversation (e.g., Hood & Lightbown, 1978; Hood & Schieffelin, 1978). More recently, however, van Moere (2012) argued for the authenticity of elicited imitation tasks from the perspective of automaticity and formulaicity in spoken communication. When preparing or giving a response within a conversation, it is necessary that speakers draw on the language used by conversational partners and summarize, or even repeat, particular statements. It is therefore reasonable to argue that natural conversation and interaction depend in part on repetition.
A more important criticism of EI stems from the uncertainty of what EI actually measures. The available literature has not clarified the underlying construct, and research investigating the construct-related validity of EI, with the exception of Zhou (2012), is limited. Debate over and the emphasis on the construct-related validity of EI reflects the change in traditional views of validity from multiple, perhaps complementary, forms of validity (e.g., content, criterion, construct-related validity) to a contemporary view of validity as a unified concept that emphasizes construct-related validity (e.g., Messick’s (1989) Unified Theory of Construct Validity).
Though not always explicitly stated, a defining characteristic of a language proficiency measure lies in its ability to assess the participants’ linguistic knowledge (i.e., their ability to process linguistic information to construct meaning). However, different views exist as to whether EI can measure one’s linguistic knowledge. Some scholars (e.g., Eisenstein, Bailey, & Madden, 1982; Naiman, 1974) advocate that EI prompts participants to process the structure and meaning of the sentences. They argue that in order to be able to repeat a sentence, one has to comprehend the meaning of the sentence. Others suspect that EI only prompts parroting (i.e., rote repetition of the chain of acoustic information without comprehension), thus only measuring the capacity of phonological short-term memory (Gathercole & Baddeley, 1993).
Resurgence of EI as a measure of implicit grammatical knowledge
Despite the criticisms of EI with respect to authenticity and construct-related validity, recent literature displays a resurgence of interest in using EI in L2 research. EI resumed to be used as a research tool to examine performance on a range of syntactic, morphosyntactic, lexical, and phonological structures (e.g., Akakura, 2012; Schimke, 2011; Trofimovich et al., 2009; Van Boxtel, Bongaerts, & Coppen, 2005; Verhagen, 2011; West, 2012). However, a growing number of studies have used EI to measure implicit grammatical knowledge as a global construct (e.g., Bowles, 2011; Ellis, 2005; Erlam, 2006; Serafini, 2013).
Increase in the number of EI studies on implicit grammatical knowledge reflects the theoretical discussions of SLA that have emerged in this era. Investigation of implicit grammatical knowledge presumes a statistical, lexical, or input-based model of language acquisition (e.g., founded on information processing theories and connectionism) and postulates that language learning is not different from other types of learning (Ellis, 2005). Statistical models of language learning place an emphasis on input frequency (i.e., exposure to language) and its effect on learners’ implicit linguistic knowledge (i.e., mental representation of linguistic knowledge and automaticity in language processing) (Ellis, 2002). That is, a learner’s ability to comprehend and produce language is largely governed by his or her internalized lexicon as well as or instead of innate syntactic rules. Learners’ mental lexicons, built on their own experiences with language input and output, store statistical information about behavior (i.e., relative frequency, concurrence patterns, and functional contexts) of lexical items and syntactic structures permanently in long-term memory (LTM; Naiman, 1974). The statistical information of the lexical and syntactic structures, in turn, allows learners to predict or project words and/or chunks automatically as they process new linguistic input.
In the case of EI tasks, when learners hear a sentence, the linguistic input is first stored temporarily as acoustic images in their short-term memory (STM). Based on assumptions of the statistical models of language acquisition, learners need to rely on their implicit linguistic knowledge to decode the linguistic input of the sentence (i.e., comprehension) and retrieve matching lexical and syntactic structures to reconstruct the meaning of the sentence (i.e., repetition). In contrast, if the linguistic input in the sentence is beyond learners’ implicit grammatical knowledge, learners may have to attempt repeating the sentence through imitation of the acoustic information without decoding the sentence for meaning (i.e., parroting or rote repetition; Gathercole & Baddeley, 1993).
The repetition of acoustic images without comprehension (i.e., the repetition of meaningless strings of sounds) is more difficult than the repetition of a string that is meaningful. Because the acoustic information is stored temporarily in STM (Huitt, 2003), rote repetition is possible only if the sentences are short and/or are continuously rehearsed, but repetition will be labored or ineffective if a delay is inserted or the sentence length exceeds the capacity of STM. In contrast, imitation with comprehension requires the speaker to decode the acoustic information in the sentence, map the sounds onto the corresponding structures and meanings, and eventually convert the selected structures to sounds to reconstruct the same meaning. By doing so, even though a sentence may exceed the capacity of STM, the speaker can access (automatically) linguistic knowledge to comprehend the sentence first and then rely on the meaning of the input and implicit linguistic knowledge to aid in reconstructing the sentence. Because of the comprehension process, the speaker may paraphrase the original sentences instead of repeating them verbatim (as has been included as an independent level in the rating scales of EI, e.g., Hamayan, Saegert, & Larudee, 1977; Markman et al., 1975), but the capture, access, and transformation of meaning can be assumed to occur only when adequate language resources are available. The elicitation of paraphrasing on EI tasks, to some extent, offers supportive construct-related validity evidence that EI requires the speaker to access implicit grammatical knowledge to comprehend the sentence before repeating.
In addition, there has been empirical evidence from previous EI literature supporting that EI measures implicit grammatical knowledge. For example, Ellis and his colleagues (Ellis, 2005; Ellis & Loewen, 2007) used factor analysis to examine the extent to which a battery of five tests (i.e., imitation (EI), oral narrative, timed grammatical judgment task (GJT), untimed GJT, and metalinguistic knowledge) measure implicit and explicit grammatical knowledge. EI was conceptually classified as a measure of implicit grammatical knowledge owing to four features: (1) respond according to feel; (2) respond under time pressure; (3) focus on meaning; and (4) requires no metalinguistic knowledge (p. 157). Results of their studies indicated that EI measures implicit grammatical knowledge. Bowles (2011) also supports their argument with evidence from a confirmatory factor analysis in a replication study. In addition, in Zhou’s (2012) review, strong correlations between EI scores and scores on oral narrative tasks ranging from .48 to .87 (p. 91) add supportive evidence linking EI performances to the acquisition and use of implicit grammatical knowledge.
Although it is impossible to directly observe how linguistic information is processed, the question regarding what EI measures can also be informed through differential-population studies (Popham, 2008), for example, those using EI tasks to discriminate between individuals across proficiency levels. Speakers of higher proficiency level tend to internalize a wider range of lexical items and syntactic structures and thus can be assumed to be more automatic and capable at accessing their implicit knowledge to repeat the sentence and less dependent on rote repetition. Similarly, higher proficiency speakers should be more able to repeat longer and linguistically more complex sentences. In contrast, when it comes to lower proficiency speakers without sufficient internalized grammatical structures, the facilitation from implicit grammatical knowledge tends to be restricted. As a result, in order to repeat sentences, they are more likely to rely on rote repetition and thereby fail on the tasks.
If EI tasks only elicit rote repetition, EI may only measure the capacity of phonological STM. Then, instead of processing the meaning of the sentences, participants can rely on their STM to recall and imitate the chain of sounds and therefore should perform indistinguishably on EI tasks regardless of their language proficiency level. That said, conceptually, the distinction between repetition with comprehension and rote repetition can be regarded as two extremes on a continuum. That is, even higher proficiency speakers may occasionally parrot sentences or parts of a sentence. However, because of the facilitation of their implicit grammatical knowledge, their reliance on rote repetition is arguably less heavy or frequent than that of lower proficiency speakers.
Four key task features that may affect the construct-related validity of EI
Along with the resurgence of EI literature in L2 research, researchers have also placed an emphasis on the design of EI tasks in relation to the sensitivity of EI. Literature on STM and EI suggests that rote repetition can happen but only under certain conditions: (1) sentences are short (Munnich, Flynn, & Martohardjono, 1994); (2) repetition takes place immediately after the stimulus (McDade, Simpson, & Lamb, 1982); and (3) imitation is continually rehearsed without interruption (Gathercole & Baddeley, 1993). If EI tasks are used to measure language proficiency, these tasks features must be incorporated in a principled manner so that the tasks can be more effective at prompting imitation with comprehension rather than rote repetition.
Vinther (2002), building on previous reviews of EI (e.g., Bley-Vroman & Chaudron, 1994; Gallimore & Tharp, 1981), suggested four key task features that influence the validity of EI tasks: (a) length of sentence stimuli, (b) delayed repetition, (c) grammatical features of the stimuli, and (d) scoring methods. Control of these variables is likely to increase the sensitivity of EI to discriminate between learners on the measured constructs (i.e., language proficiency) and reduce construct-irrelevant variance (i.e., STM capacity).
Length of sentence stimuli
Sentence length has been frequently observed as a factor that influences the difficulty of EI tasks (e.g., Miller, 1973; Perkins, Brutten, & Angelis, 1986). In order to measure language comprehension (i.e., to minimize the effect of working memory), the length of sentence stimuli must exceed the learners’ STM capacity. However, L2 researchers have not agreed on cutoffs for an appropriate sentence length – the length that would best distinguish between learners at different levels. Regarding the limit of STM, Miller’s Law (1956) states that the number of chunks (be they syllables, numbers, words, or sequences) that one can hold in STM is 7 ± 2. The “magic number seven” coincides with Perkins et al. (1986), who suggest that the length of the sentences be set at seven to eight syllables. However, Naiman (1974) chose sentences of 15 syllables for first- and second-grade L2 learners and considered the length to be appropriate. This choice was also selected in the assessment for adult learners conducted by Eisenstein et al. (1982). Jensen and Vinther (2003) chose even longer sentences, the majority of which exceeded 16 syllables and found that most native speakers were capable of repeating the sentences.
In this review, we selected the following cut-offs to break down sentences into three length bands: 1 short (< 8 syllables), medium (8–15 syllables), long (> 15 syllables) (see Table 1). Overall, 22 out of 58 studies used stimuli sentences of medium length while 21 studies used stimuli sentences of varying lengths (i.e., sentences across two or even three length bands).
Repetition delay
The insertion of delay often takes the form of a period of silence (usually three to five seconds) or an interruptive task (e.g., answering a cognitively unchallenging question) before repetition. As discussed earlier, repetition of sentences without comprehension is possible if the learner continuously rehearses the chain of sounds before repetition; therefore, the insertion of delay should interrupt continual rehearsal. However, as Vinther (2002) argues, the insertion of delay may also interfere the processing of the structure and meaning of the sentences, especially when the sentences are long. As Table 1 shows, only 21 out of 58 studies specified implementation of delayed repetition in their EI tasks.
Grammatical features of the sentence stimuli
The difficulty of EI tasks has been shown to be influenced by linguistic features of the sentence stimuli, including among others: syntactic complexity (Ortega, 2000), lexical difficulty (Graham, McGhee, & Millard, 2010), phonological structure of the words in the sentence (Menyuk, 1971), and the use of ungrammatical sentences (Erlam, 2006). As shown in Table 1, it appears that the most common ways of controlling linguistic features of the sentence stimuli are the control of the syntactic and morphosyntactic features of the sentence (k = 36) and the use of ungrammatical sentences (k = 22).
The relationship between features of syntactic, lexical, and phonological complexity and the resultant difficulty of EI tasks can be frequently observed across studies (Graham et al., 2010; Menyuk, 1971; Perkins et al., 1986; Ortega, 2000). However, the use of ungrammatical sentences is much debated. The rationale for using ungrammatical sentences is that these sentences naturally elicit automatic correction of grammatical errors especially when the sentence length exceeds the capacity of STM. Hamayan et al. (1977) argued that failure to correct grammatical errors is evidence of inadequate implicit knowledge of the target structures. However, error correction does not necessarily occur even among native speakers (Markman et al., 1975), especially when the instructions do not require subjects to do so, which poses questions on the usefulness of ungrammatical sentences in measuring the target linguistic structures.
Scoring method
The three most common approaches to scoring EI responses are the binary yes–no approach (k = 24), the ordinal rating scale approach (k = 15), and the interval scale approach (e.g., number or percentage of errors, or automated measures of prosodic features, k = 15). The yes–no approach only gives two possible scores for each EI response: 1 for correct repetition and 0 for incorrect repetition (e.g., Ellis, 2005; Erlam, 2006). The rating scale approach establishes a rating rubric, usually more than three score levels, to quantify the accuracy of repetition (e.g., Markman et al., 1975). The interval scale approach often utilizes error rate of particular grammatical features (e.g., West, 2012) and automated scoring tools for more complicated linguistic analysis (e.g., Longsdale & Christensen, 2011; Trofimovich & Baker, 2007). It is reasonable to speculate that the choice of scoring method may influence the reliability of EI and its ability to distinguish between speakers across proficiency levels. However, the impact of the choice of scoring method remains less investigated than other task features in the literature.
Summary of Phase I
The narrative review shows that EI has been widely used in L2 research; however, the constructs argued to be measured by EI have undergone interesting shifts over time, with more studies focusing on specific linguistic structures and implicit grammatical knowledge. In addition, EI has been used as an outcome measure for the effectiveness of certain treatments. The resurgence of EI studies in the literature indicates that EI has regained attention from L2 researchers as a potential useful tool to measure L2 proficiency. Nevertheless, our survey of the extant empirical EI studies indicates a great degree of variation in the design of four key EI task features, all of which are associated with the construct-related validity of EI. The extent to which variation in the design of EI tasks has an impact on the quality of the measurement requires further investigation.
Phase II: Meta-analytic investigation
Findings from the narrative review form the theoretical basis for the quantitative meta-analysis in Phase II, which helps clarify the construct measured by EI and further informs whether variation in the design of key task features has an impact on the sensitivity of EI. The meta-analysis presented in this phase specifically addresses two questions: (1) whether (and to what extent) EI tasks can measure the proficiency of speakers; and (2) whether (and to what extent) the sensitivity of EI differs across designs of the task features discussed in Phase I.
Methods
Study selection criteria
We included two types of studies in Phase II to investigate the construct-related validity of EI as a measure of L2 proficiency: (1) studies that compared EI scores across groups of different proficiency levels (Type I); and (2) studies that examined the relationship between scores on EI and other measures of language proficiency (Type II) (see Figure 1). The two types of studies offer evidence for the construct-related validity of EI from different perspectives. In particular, Type I studies provide evidence for whether EI tasks can discriminate between speakers with different language proficiency levels, while Type II studies support construct similarities measured by other well-known language proficiency tests.
In addition to the selection criteria for Phase I, studies to be included in a meta-analysis must fit two additional conditions: (1) the study has at least two groups of participants at two different proficiency levels (e.g., advanced vs. intermediate) to be compared quantitatively (Type I) or the study reports a correlation between EI task scores and scores on another language proficiency test (Type II); and (2) the researchers should report means, standard deviations, sample sizes and/or other statistical results (e.g., t-statistic, Pearson’s r, chi-square statistic, or F-statistic with a degree of freedom of 1) that are required for retrieving an effect size.
Identification of studies
Among the 76 studies included in the narrative review, 30 studies were Type I studies. However, only 10 studies out of 30 met the additional criteria for meta-analysis. The other 20 studies did not report sufficient statistics that enable us to compute standardized mean differences and were therefore excluded from the meta-analysis. Efforts were also made to request the statistical information necessary to compute effect sizes by contacting primary authors; however, the attempts were not successful. Similarly, of the 28 Type II studies, we were able to retrieve Pearson r correlations from 11 studies. The process identified a total of 21 studies that met the selection criteria, including 13 published journal articles and eight unpublished doctoral dissertations or master theses (see Figure 1).
Data extraction
From type I studies, we were able to extract 24 stochastically dependent and 13 independent effect sizes (representing 498 cases) from the 10 Type I studies. Note that we said that the effect size is dependent when the same dyad of higher and lower proficiency groups provides more than one effect size, while the effect size is independent when we retrieve only one effect size from the dyad. Further description regarding handling statistically dependent effect sizes in data analyses is discussed later.
The effect sizes obtained from Type I studies take the form of the standard mean differences that indicate the differences of mean EI scores between lower and higher language proficiency speakers (e.g., either native or advanced L2 speakers). For Type II effect sizes, which are Pearson r correlation coefficients, we retrieved 44 dependent and 11 independent effect sizes (representing 591 cases) that indicate the extent to which EI scores are correlated with scores on other establish measures of L2 proficiency from the 11 Type II studies.
For data analyses, both types of effect sizes were then converted to a common estimator called Hedges’ g (Hedges, 1981). Hedges’ g takes a metric of the standardized mean difference in EI test scores between higher and lower proficiency groups and is a less biased estimator comparing with Cohen’s d (1988) for primary studies with small sample sizes.
While careful judgement is needed to transform different indices of effect size into a common metric (Aloe & Shisler, 2015; Borenstein, Hedges, Higgins, & Rothstein, 2009), we consider combining two types of effect size (i.e., standard mean differences from Type I study and correlation coefficients from Type II studies) appropriate in our meta-analysis with the assumption that two different effect size indices represent a common construct (i.e., the sensitivity of EI) in relevant ways, but with different research designs. However, we acknowledge that by converting correlation coefficients into standard mean differences, we are creating a dichotomized representation of language proficiency (a continuous variable), vice versa. Therefore, owing to the slightly different representations of the construct by the two types of effect sizes, we performed moderator analyses to examine whether the two types of study led to any significant differences in the variance of effect sizes as a sensitivity analysis (see Data analysis in the following section).
A conventional interpretation of a positive Hedges’ g is that higher proficiency speakers tend to score higher on EI tasks than lower proficiency speakers, which supports the construct-related validity of EI as a measure of L2 proficiency; on the other hand, a negative or close-to-zero effect size means that higher proficiency speakers perform similarly to lower proficiency speakers on EI tasks, suggesting that EI is not a sensitive or valid measure to distinguish speakers across different levels of L2 proficiency.
Data analysis
We used a random-effects model as the theoretical framework for combining effect sizes because this meta-analysis may only represent a sample of all the studies that compare performance on EI tasks across L2 proficiency levels (Hedges & Vevea, 1998). The Q test (Hedges & Olkin, 1985) was conducted to evaluate the homogeneity of retrieved effect sizes. An alpha level of .05 was set for statistical significance. In addition, an I2 statistic, which indicates the ratio of the true heterogeneity (between-study variance) to the total variance across the observed effect estimates (Higgins, Thompson, Deeks, & Altman, 2003), was calculated to quantify the amount of variation in the effect sizes due to the differences between studies. Weights, calculated by taking the inverse of the variance of each effect size, were used to reflect the precision of the estimated effect sizes retrieved from each study. We used Comprehensive Meta-Analysis software (Version 2; Borenstein, Hedges, Higgins, & Rothstein, 2007) to run all the statistical analyses involved in the meta-analysis.
Handling multiple effect sizes
Multiple effect sizes obtained from the same study may violate the statistical assumption of independence for inferential analyses in meta-analysis. Multiple effect sizes were retrieved from eight of the 10 studies except Zhou (2012) and Iwashita (2009). Recall that effect sizes obtained from different comparison-groups were considered independent (e.g., L1, 2010; Wu & Ortega, 2013); others obtained from the comparison of the same two groups were considered statistically dependent (e.g., Bowles, 2011; Erlam, 2006; Flynn, 1986; West, 2012), which require some adjustments to avoid statistical violations and to ensure the validity of our analyses. Hence, we established the following criteria to resolve the issue as described below.
In the studies (e.g., Bowles, 2011; Erlam, 2006; Flynn, 1986; Serafini, 2013; Yoon, 2010) that reported both total scores and subsection scores, the total scores were selected for computing effect size from the study.
In Type I studies (e.g., West, 2012) that used multiple comparable interval variables to score EI, effect sizes for individual variables were averaged. Likewise, in Type II studies that reported correlations between EI and multiple measures of language proficiency, all effect sizes were averaged.
In Type I studies (e.g., Bowles, 2011; West, 2012) that included three proficiency level groups, the effect size for the two adjacent levels that had the smaller mean score difference was selected. Although the selection of the smaller effect sizes might underestimate the magnitude of the effect sizes (i.e., the discrimination associated with EI), we chose to be conservative and consistent as we were interested in examining the ability of EI tasks to make relatively fine distinctions between adjacent proficiency levels.
In the quasi-experimental Type I studies (e.g., Flynn, 1986; Serafini, 2013), only comparison on pretest scores was selected in order to avoid an intervention effect.
In the studies (e.g., Li, 2010; Wu & Ortega, 2013) that included multiple independent groups for each proficiency level, the effect sizes for all independent group comparisons were used instead of the effect sizes for the combined total group comparison.
This process enabled us to retrieve 120 dependent and 24 independent effect sizes from the 21 studies.
Moderator analyses
Moderator analyses were performed to investigate how the design features of EI task may relate to the sensitivity of EI to distinguish different proficiency groups. We grouped studies by the design of different task features that we reviewed in Phase I and computed the weighted effect sizes across different designs of EI task features. The moderators we selected from Phase I included the following: (1) the length of sentence stimuli; (2) the use of ungrammatical sentence stimuli; (3) the insertion of delay; and (4) the scoring method. In addition, we examined the sensitivity of EI with respect to the nature of construct, by comparing weighted average effect sizes between studies that use EI to measure global language proficiency and studies that target on specific linguistic structures.
Results
The forest plot of the 24 independent Hedges’ g effect sizes with their 95% confidence intervals is shown in Figure 3. The mid-point of each line represents the point estimate of the effect size for each study. The overall weighted average of the 24 effect sizes was 1.34, with a standard deviation of 0.13, indicating that, on average, scores on EI differ by 1.34 standard deviations between lower and higher proficiency speakers. That is, despite the fact that the primary studies used an impressionistic approach to classify the speakers and that these studies sampled different ranges of language proficiency, EI scores can consistently distinguish performance of higher and lower proficiency speakers within the particular range of language proficiency sampled in each study. Following either Cohen’s (1988) guideline for interpreting effect sizes or a more discipline-specific guideline proposed by Plonsky and Oswald (2014), the weighted average effect size can be regarded as very large, indicating that EI is a highly sensitive measure. This adds supportive construct-related validity evidence to EI as a measure of L2 proficiency.

Hedges’ g effect sizes with 95% confidence intervals.
In addition, the forest plot shows that the magnitude of the effect size varies across studies. The test of homogeneity for effect sizes indicates that the effect sizes varied significantly across studies, Q(23) = 128.72, p < .001. The estimated between-study variance of effect sizes τ2 was 0.31, which suggests a large variation in the effect sizes. The I2 statistic was 82.13%, which indicates that a large proportion of variation in effect sizes was due to the differences across individual studies. In summary, these statistical results indicate that while, on average, EI is an effective measure of L2 proficiency, the sensitivity of EI differs in great extent across studies possibly depending on the way EI tasks were designed and implemented. Because of the variation in effect sizes, we conducted moderator analyses to identify sources of variation in the effect sizes.
Research design as a moderator for the magnitude of effect sizes
As previously mentioned, the meta-analysis synthesized two types of study that might provide construct-related validity evidence of EI from different approaches. Thus, we first examined whether the type of research design used in primary study (i.e., Type I and II studies) makes a difference in the magnitude of effect sizes. Table 3 reports summary of group-comparison studies. The weighted average effect size for the 13 effect sizes was 1.38, with a standard deviation of 0.22. Similar to the overall effect, the large effect size suggests that EI tasks can effectively distinguish between speakers of different proficiency levels. The fact that higher proficiency speakers performed consistently better on EI tasks than did lower proficiency speakers across studies implies that EI measures language proficiency. More specifically, higher proficiency speakers were more capable of repeating the sentences than were lower proficiency speakers for all studies. Therefore, it is more likely that, in order to repeat sentences, the speaker has to rely on his or her internalized linguistic knowledge to decode the structural information of the sentence and then reconstruct the meaning of the sentence. In other words, parroting or rote repetition of a chain of sounds alone does not allow for successful completion of EI tasks. However, it should be noted that, although EI was consistently sensitive across studies, the majority of the 10 studies (13 effect sizes) used EI to discriminate high-proficiency speakers from intermediate- or low-proficiency speakers; only two studies (Iwashita, 2009; Serafini, 2013) used EI tasks to distinguish between intermediate- and low-level speakers. Thus, caution is needed when generalizing this finding to the ability of EI to distinguish speakers of all proficiency levels, especially among lower proficiency speakers.
Summary of studies included in meta-analysis (reporting group mean differences).
S = short, M = medium, L = long, V = varied; Y = yes, N = no; Gr = use of ungrammatical sentences, Lg = sentence length, Dl = delayed imitation; PL = proficiency level, HG = high, HI = high intermediate, IM = intermediate, LW = low, NS = Native.
Table 4 summarizes studies that reported correlations between EI scores and scores on other measures of language proficiency. The weighted average Hedges’ g for the 11 effect sizes was 1.31 (equivalent to an average correlation coefficient of .59), with a standard deviation of 0.18. This suggests that performance of L2 speakers on EI tasks were strongly correlated with their performances on other L2 proficiency measures. This large effect size offers similar construct-related validity evidence, supporting EI as a sensitive measure of L2 proficiency. A Q test examining the homogeneity of average effect sizes did not yield a significant result, Q(1) = 0.44, p = .51, suggesting that type of study does not make a difference in the magnitude of effect sizes. Therefore, we combined all the effect sizes from the two types of study in the subsequent moderator analyses.
Summary of studies included in meta-analysis (reporting correlations between scores on EI and other measures).
M = medium, V = varied; Y = yes, N = no; Gr = use of ungrammatical sentences, Lg = sentence length, Dl = delayed imitation.
Task features as moderators for the sensitivity of EI
Table 5 summarizes the results of the moderator analyses of EI studies, grouped by the design of key task features. Overall, three task features showed a significant moderating effect on the sensitivity of EI: construct, sentence length, and scoring method. It is important to note that all confidence intervals did not overlap with 0, indicating that, despite the variation in task design, EI in general is sensitive to difference of proficiency level. However, different designs on certain task features appeared to lead to differential sensitivity of EI.
Comparisons of the sensitivity of EI expressed by Hedges’ g across designs of certain task features.
First, nature of construct was identified as a significant moderator. EI is more sensitive when used to measure global constructs (e.g., general language proficiency, implicit grammatical knowledge) (k = 9, g = 1.59, SE = 0.08) than specific (syntactic, morphosyntactic, phonological) linguistic constructs (k = 12, g = 1.00, SE = 0.07), Q(1) = 32.41, p < .001. This suggests that performance on EI tasks entails processes that reflect a speaker’s general language proficiency more than discrete grammatical knowledge or skills.
The second moderator was sentence length. The 12 studies that employed sentences of varying length (k = 12, g = 1.51, SE = 0.07) had a significant larger effect size than did studies that did not (k = 10, g = 0.96, SE = 0.10), Q(1) = 20.66, p < .001. This suggests that variation in the length of EI sentence stimuli leads to heightened sensitivity in EI when used to measure different L2 proficiency levels. Third, among the three commonly used scoring methods, ordinal rating scale (k = 8, g = 1.61, SE = 0.08) was more capable of distinguishing speakers across proficiency levels than binary scale (k = 7, g = 1.16, SE = 0.08) and internal scale or mixed scoring method (k = 5, g = 0.77, SE = 0.14), Q(2) = 33.22, p < .001.
Finally, as also reported in Table 5, use of ungrammatical sentences and insertion of delay were not identified as moderators for the sensitivity of EI. The use of ungrammatical sentences did not lead to a heightened sensitivity of EI, Q(1) = 2.87, p = .09. Regardless of the use of ungrammatical sentences, EI tasks consistently distinguish higher and lower proficiency speakers. It is interesting to note that EI tasks that used ungrammatical sentences (k = 9, g = 1.16, SE = 0.09) appeared to be less discriminating than EI tasks that did not (k = 15, g = 1.34, SE = 0.07). Similarly, insertion of delay did not have a significant impact on the sensitivity of EI tasks, Q(1) = 0.31, p = .58; EI tasks that adopted delayed imitation (k = 13, g = 1.25, SE = 0.07) appeared to be less discriminating than EI tasks that do not (k = 11, g = 1.30, SE = 0.08), Q(1) = 0.31, p = .58; however, the 95% confidence intervals for the two groups largely overlapped, suggesting that the insertion of delay does not necessarily add much variation to the sensitivity of EI scores. These two task features will be further discussed in the following section.
Additionally, we examined the potential impact of publication bias on the validity of our statistical conclusion. Figure 4 shows a funnel plot with 24 independent effect sizes. The analysis indicated that studies with small sample sizes are absent in our pool of primary studies; however, overall, the funnel plot looks symmetrical. Because studies with larger sample sizes provide precise estimates and are more likely to be published and included in a meta-analysis, our result might overestimate the overall sensitivity of EI. However, these missing studies with small sample sizes tend to carry small weight and thus provide less impact on the present meta-analytic results. In addition, because the meta-analysis showed strong support for the construct-related validity of EI, it is unlikely that this finding would be reversed with additional studies with small sample size.

Funnel plot for effect sizes extracted from EI studies.
Summary for Phase II
Findings of the meta-analysis suggest that, in general, EI is a sensitive measure to distinguish between speakers across proficiency levels. In terms of EI task design, we found no principled or systematic ways of developing and implementing EI tasks across studies, with great variability in how EI tasks have been designed and administered. Nevertheless, moderator analyses of the effect sizes by EI design features suggest that the ability of EI tasks to differentiate speakers across proficiency levels can be strengthened by the manipulation of certain task features. This implies the importance of principled EI task development for increasing the sensitivity of the instrument.
Overall discussion and implications
In this two-phase systematic review, we examined the use of EI, in an effort to clarify the construct measured by EI, and, more importantly, to advance discussions toward a more principled practice of EI in L2 research. Therefore, we further discuss the implications of these findings below from two perspectives: usefulness of EI as a measure of L2 proficiency and the design of certain key EI task features.
Usefulness of EI as a measure in classroom and standardized assessment contexts
Results of this study support the idea that EI is an effective measure of global language proficiency, specific linguistic structures, and the effectiveness of instructional interventions. The simple and economical administration procedures and the flexibility in the design of task features makes EI an attractive candidate for a quick and effective measure of language-related constructs in both classroom and standardized assessment contexts.
To understand better the usefulness of EI tasks in both classroom and standardized assessment contexts, future research examining the connection between psycholinguistic measures and performance-based measures is imperative in order to connect the two main approaches to language testing. While it is argued that EI and similar psycholinguistic measurements lack authenticity (Bachman, 1990; Morrow, 1979), they usually outperform interactive, performance-based tasks in terms of score reliability (Bernstein, van Moere, & Cheng, 2010; van Moere, 2012). In addition, EI tasks can facilitate classroom assessment in language classes owing to its simplicity in administration and reliability for scoring (as demonstrated in Zhou (2012)). Employing a combination of psycholinguistic and performance-based measures can complement the limits of each type of measure, thus optimizing the usefulness of multiple measures of the same construct (van Moere, 2012). However, future research should move beyond simply examining correlations of holistic scores on those measurements toward analyzing the alignment of specific linguistic or non-linguistic features of the tasks crucial to communication.
Design of certain EI task features
Previous reviews of EI (e.g., Vinther, 2002) pointed out a number of task features that may affect the construct-related validity of EI as a measure of language proficiency. However, the findings of this study suggest that the manipulation of three task features (i.e., nature of construct, sentence length, and scoring method) distinguishes EI performances across proficiency levels better. First, EI is more sensitive when used to measure global constructs than discrete grammatical knowledge or skills. Second, EI tasks using sentences with varying length are more discriminating than EI tasks using sentences with fixed length. It is interesting to note that most studies that had varying sentence length had an average length of about 15 syllables. The average sentence length of these EI tasks was comparable to EI tasks with fixed length, the majority of which were of medium length (see Phase I). The higher sensitivity of EI with varying sentence length possibly results from the fact that sentence length tends to be positively correlated with the difficulty of EI tasks (e.g., Miller, 1973; Perkins et al., 1986). In addition, results of the meta-analysis in this study suggest that higher proficiency speakers tend to be more capable of sentence repetition than do lower proficiency speakers even when it comes to longer sentences (e.g., Serafini, 2013). Thus, EI tasks with varied sentence length will more likely match the ability of speakers with different proficiency levels. Third, EI discriminates better when a more refined rating scale is used. This suggests that EI can elicit a varying degree of repetition accuracy from L2 speakers, offering potential for EI to measure performances across a wide range of proficiency level.
It is interesting to note that the insertion of delay and use of ungrammatical sentences, though frequently recommended and applied in EI task development, did not show a significant moderating effect on the sensitivity of EI. However, caution should be applied when interpreting the absence of a significant moderating effect associated with these two task features. First, a number of studies were not coded as employing delayed imitation because they did not specify the insertion of delay in the article. It is possible that some of these studies applied delay of some sort but did not report in writing. Therefore, the impact of the insertion of delay is better examined in experimental settings in future research.
Regarding the use of ungrammatical sentences, Hamayan et al. (1977) argued that ungrammatical sentence stimuli can elicit correction of grammatical errors and that higher proficiency speakers tend to be more able to correct grammatical errors automatically than are lower proficiency speakers. Their argument supports the statistical approach to language comprehension and production (Ellis, 2002). Once the sentence is decoded for meaning, instead of retaining the original ungrammatical structure, the speaker reconstructs the meaning of the sentence based on the frequency patterns of relevant lexical and syntactic structures in his or her implicit grammatical knowledge and automatically corrects the grammatical errors in the repetition. However, on EI tasks, the automatic error correction can be disrupted by the administration procedures. As most EI tasks with ungrammatical sentences require participants to repeat the sentences verbatim without mentioning the grammatical errors in the sentence stimuli, it remains unknown whether failure to correct grammatical errors in the stimulus sentence is a result of examinees simply following the instructions. Although the purpose of using ungrammatical sentences is to elicit error correction through implicit grammatical knowledge, not all EI tasks clearly instruct the participants to correct errors in the sentence, and, therefore, error correction is not guaranteed (see, e.g., Markman et al., 1975).
Recommendations for future research and the use of EI tasks
Based on our findings, we recommend future studies using EI as a measure of L2 proficiency take the impact of the aforementioned three task features into account for designing effective EI tasks. In addition, further research efforts should be made to investigate how the design of key EI task features functions under specific assessment purposes and contexts. For example, future investigations might focus on identifying optimal sentence length used in EI tasks in relation to the target proficiency levels. Jessop et al. (2007) has pointed out the lack of consensus on the appropriate length that can best measure participants’ language proficiency; however, an important factor to consider when choosing the sentence length, though less explicitly articulated in the literature, is learners’ language proficiency level. That is, higher proficiency speakers tend to repeat longer sentences as compared with lower proficiency speakers. From a measurement perspective, the more the task difficulty matches the target proficiency level, the more reliable the scores are and thus the more valid the judgment about the proficiency level of the learner can be. As discussed previously, using a range of sentence length is likely to vary the difficulty level of the EI tasks, which increases the potential of EI tasks to target at multiple proficiency levels. Yet, the present study suggests that the selection of the range of sentence length in relation to the appropriate difficulty levels or proficiency levels remains under explored.
In addition, future research on examining the impact of the administration and scoring procedures for the imitation of ungrammatical sentences on the sensitivity of EI is beneficial. Investigations in search of appropriate instructions will help examine whether (and the extent to which) EI can be used to measure automatic error correction. However, researchers should be reminded that the administration procedures may also create construct-irrelevant score variance (Kaplan, 1996; Munnich et al., 1994). To ameliorate construct-irrelevant variance, one should be careful about directing too much attention from the participants to the grammatical errors as the nature of the target construct may change if error correction becomes less automatic.
Finally, to facilitate better the systematic review of the sensitivity of EI, we strongly recommend that future researchers follow the systematic reporting practices of empirical and statistical results. As Plonsky (2013) argues, the reporting of descriptive statistics, including sample sizes, means and standard deviations, “avails primary data to would-be meta-analysts who require such data to calculate an effect size” (p. 671). Missing or insufficient empirical information not only discounts the generalizability of findings but also inhibits readers from achieving a better understanding and assessment of the results and findings of the primary studies.
Overall, we conclude that EI tasks have the potential to distinguish effectively and reliably performances across proficiency levels. The results of our effort provide supportive construct-related validity evidence for EI as a measure of L2 proficiency and contribute to continued and extended discussion of EI towards a more principled practice for the development of EI tasks in L2 research.
Kuhn (1962) in his discussion of the structure of scientific revolutions has argued that changes in paradigms are characterized by one (e.g., structural) being replaced by another (e.g., communicative) followed by a reassessment in which the interests of both realign. Perhaps the current interest in EI, disfavored for several decades, represents the beginning of a realignment in which the usefulness of EI can be reassessed within the broader research context emphasizing communicative activities and interaction. EI, as well as other psycholinguistic measures, can be employed in combination with, instead of replaced by, communicative language tasks, to ultimately enhance the quality of L2 teaching and assessment.
Footnotes
Acknowledgements
We would like thank the anonymous reviewers for their insightful comments and valuable suggestions on the earlier versions of this paper. Any remaining errors in this publication are ours alone.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
