Abstract
Research on second language (L2) grammar in task-based language learning has yielded inconsistent results regarding the effects of task-complexity, prompting calls for more nuanced analyses of L2 development and task performance. The present cross-sectional study contributes to this discussion by comparing the performance of 245 learners of German at two universities in the USA on two types of assessment tasks using multidimensional analyses of grammatical accuracy, fluency and complexity. Results show that: (1) grammatical accuracy in learner performance did not improve linearly across two years of instruction in either task condition; (2) participants tended to perform more accurately in the integrative task than on discrete-point items; (3) second-year learners wrote more fluently than first-year learners; and (4) fourth-semester learners wrote more complex sentences than other groups. The results yield important research, pedagogical and curricular insights.
I Introduction
There is a concrete pedagogical need to determine what can be expected of students in an instructed setting. Swaffar and Arens (2005) propose that ‘departments must establish standards for outcomes that reflect not mastery but attainable, reasonable degrees of control. They must outline learner progress in various dimensions. Grammatically, that outline must specify which structures and functions learners will have under full control, which under partial control, and which under conceptual control (as recognition)’ (p. 38). However, as yet, little information is available on appropriate benchmarks for learning German as a second language (L2). Consequently, one goal of the current investigation was to identify stages in accurate grammar use over the first four semesters in a university setting, drawing primarily on research in task-based language learning (Ellis, 2005; Kuiken & Vedder, 2008), supplemented by theoretical insights from dynamic systems research (a special issue of Language Learning edited by Ellis & Larsen-Freeman, 2009; Verspoor, Schmid, & Xu, 2012). Another goal was to explore how different task types may affect students’ performance, with a special focus on increasing our understanding of the constructs underlying complexity, accuracy and fluency (CAF), as suggested by Lambert and Kormos (2014) and Norris and Ortega (2009). Therefore, the current investigation sought to describe learners’ performance patterns of German verb morphology and syntax – based on grammatical features that emerged from the data – across the first two years of L2 learning. With these objectives in mind, we first provide an overview of research on task-based language teaching (TBLT), followed by a brief review of relevant concepts and descriptive tools used in dynamic systems research.
II Review of literature
1 The effects of task type on complexity, accuracy and fluency
A robust body of research pertains to task-based pedagogy, with special focus on the effects of task complexity on language performance – complexity, accuracy and fluency – and attention capacity (Ellis, 2005; Skehan, 1998; Skehan & Foster, 1999; Robinson, 2001). TBLT research has explored the interplay between task characteristics and learners’ allocation of attentional resources (Ellis, 2005; Housen, Kuiken & Vedder, 2012; Robinson, Cadierno & Shirai, 2009; Skehan & Foster, 2012). The Limited Attentional Capacity Model (Skehan, 1998; Skehan & Foster, 2001) views attentional resources as limited: as the cognitive demand of a task increases, there is less attentional capacity available, and meaning will be prioritized over form by the learner. In his framework for characterizing task difficulty Skehan (1998) discusses code complexity (e.g. linguistic complexity and vocabulary load), cognitive complexity (e.g. topic familiarity; information processing), and communicative stress (e.g. time pressure). In contrast, Robinson’s Multiple Attentional Resources Model – the Cognition Hypothesis (Robinson, 2005) – stipulates that ‘dimensions of cognitive task complexity belong to different attentional resource pools’ (p. 50), suggesting that increased task complexity does not necessarily lead to reduced performance. On the contrary, Robinson argues, tasks with higher structural complexity may increase accuracy, possibly due to cognitive (e.g. conceptual and procedural demands) as well as interactive factors (e.g. interactional and interactant demands). To date, results from this strand of research are mixed (e.g. Ellis & Yuan, 2004; Johnson, Mercado & Acevedo, 2012; Ong & Zhang, 2010), which may be partially due to divergent ways of measuring learner performance.
Until now the main analytic pillars of task-based research have been complexity, accuracy and fluency. Typically, complexity is measured in terms of coordination and subordination, accuracy in the ratio of error free units, and fluency in the rate of speech or in the number of words written in a set amount of time (Ellis, 2005; Kuiken & Vedder, 2007). In an early task-based study, Ellis and Yuan (2004) found that L2 learners, especially at lower levels of proficiency, could not attend to meaning and accuracy simultaneously in oral tasks, requiring compromises between task complexity and accuracy. Similarly, Kuiken and Vedder (2008) analysed the writing of Dutch learners’ of Italian (complete beginners) and French (5–6 years of language experience) on two tasks. They measured the effects of task complexity on learners’ grammatical accuracy, which improved with more complex tasks, and on syntactic complexity and lexical variety (neither variable improved). The participants made fewer first- and second-degree errors (i.e. minor and medium-level spelling and grammatical errors that did not severely interfere with comprehensibility) on the more complex task than on the less complex ones.
In a related study, Dykstra-Pruim (2003) examined the productive language abilities of students of German as a foreign language on three tasks (oral and written narration tasks, and a written grammar task). In a cross-sectional analysis she measured second through fourth semester learners’ knowledge of present-tense verbal morphology and syntax. She found that the oral narration task ‘more effectively discriminated conjugation abilities among semesters than did the written task or the grammar test. Meanwhile, the written task more effectively discriminated among the semesters in terms of accuracy of difficult-type word orders’ (p. 70). Furthermore, explicit grammar knowledge correlated ‘more strongly with written than with oral abilities’ (p. 73). These findings led the author to conclude that explicit grammar tests are ineffective for measuring learners’ ability to use grammar structures in communicative contexts. She also observed surprisingly large variation in performance on different grammar points at various stages of development, reiterating Larsen-Freeman’s (2006) notions of dynamic systems.
Given the often disparate findings in TBLT research, scholars have called for several methodological modifications to include more nuanced and accurate tools for analysing learner language (Lambert & Kormos, 2014; Norris & Ortega, 2009). Pallotti (2009) argues, for example that complexity may reflect personal stylistic preferences and not signal L2 development per se. As well, accuracy measures can be misleading because early learners can use memorized chunks of language and appear to be more accurate than intermediate and advanced learners who use language more creatively albeit less accurately (Larsen-Freeman, 2003; Myles, 2012). This is not a clear-cut case, however, as studies on the development of formulaic language (Boers, Eyckmans, Kappel, Stengers, & Demecheleer, 2006; Crossley, Salsbury, & Mcnamara, 2014; Serrano, Stengers, & Housen, 2015) found that learners’ accuracy on multi-word chunks correlated significantly to proficiency and learners’ exposure to the L2. They argue, therefore, that memorized formulaic chunks should not be seen as giving undue advantage to beginning L2 learners. In fact, overall, research in second language acquisition (SLA) shows a trend of increased linguistic performance over time, that is more complex, accurate and fluent (Boers, et al., 2006; Gass, et al., 2013; Serrano, Stengers, & Housen, 2015), albeit in dynamic ways instead of a linear fashion (Larsen-Freeman, 2006; Thewissen, 2013).
2 Theoretical insights from a dynamic systems perspective
As mentioned above, Dykstra-Pruim (2003) noted individual variability in learner performance across different grammatical structures. Research on dynamic systems has attempted to develop a framework that addresses such variability within and variation among student performance (Larsen-Freeman, 2006; Verspoor, de Bot, & Lowie, 2011). From this perspective, the L2 process is described as non-linear, iterative (where the same territory is repeatedly revisited), depending on the social context, individual psycholinguistic processes, and the task to be completed. Several large-scale dynamic systems studies have investigated the development of accuracy across proficiency levels (Thewissen, 2013; Verspoor, Schmid, & Xu, 2012) and of linguistic complexity (Vyatkina, 2012).
In a quasi-longitudinal study, Thewissen (2013) analysed essays written in English by 223 native speakers of French, German and Spanish at the intermediate and advanced levels; she found that errors fell into three main developmental patterns:
strong: when learners improved between adjacent proficiency levels; in her study, lexical and overall accuracy;
weak: when learners improved between non-adjacent levels; for example, incorrect determiners, verb morphology; and
non-progressive: when learners did not improve across levels; for example, verb tense, coordination.
Verspoor et al. (2012) explored how the theoretical assumptions of dynamic usage based approaches could help establish objective trajectories to assess L2 learners’ written texts. In a cross-sectional study, 489 Dutch pupils learning English at the beginning and intermediate levels wrote simple, personalized essays of maximally 200 words. Based on the coding of 64 measures (sentence, phrase and word level) the authors found that between levels 1 (CEFR: A1.1) and 2 (CEFR: A1.2) students’ writing advanced mostly at the word level, between levels 2 and 3 (CEFR: A2) most advances were at the syntactic level, and between levels 3 and 4 (CEFR: B1.1) advances were mostly in lexical and syntactic use.
Another large-scale study by Vyatkina (2012) on L2 writing confirmed these observations for the development of complexity. In spite of significant individual variability within and variation across participants, American learners of German showed an overall linear development in complexity in timed and untimed essay writing over the first two years of instruction. More specifically, learners wrote more complex essays regarding a general (sentence length) and a more specific complexity measure (finite verb units, subordination, and lexical variety).
To date, only few empirical studies have been conducted that help in advancing the goal of developing performance benchmarks for instructed language programs. Research has assessed L2 learners’ accurate L2 use and the interplay with fluency and complexity across a variety of tasks. Yet studies have not compared language use across tasks traditionally used as performance measures, such as discrete-point and integrative writing tasks. Likewise, insights from and generalizability of task-focused studies have been limited due to a restricted operationalization of performance measures, such as degrees of errors (e.g. Kuiken & Vedder, 2008) and measures of complexity. Therefore, the current investigation adopted descriptive tools from dynamic systems research to describe learner performance across two tasks. Moreover, by conducting a multidimensional analysis of complexity, accuracy and fluency in the writing of American learners of German across two years of instructed L2 learning, the study sought to advance our agenda of establishing performance benchmarks.
III The present study
The present large-scale, cross-sectional study sought to identify patterns of variability on two task types commonly found in classroom assessment (including workbooks and testing manuals accompanying textbooks): discrete-point tasks and an integrative essay task. Focusing specifically on the production of German verb morphology, fluency and complexity across the first two years in an instructed setting, the study seeks answers to the following research questions:
Do learners at four points of development (semester 1 through semester 4) perform significantly differently on discrete-point items testing seven morphosyntactic features of German?
Do learners at four points of development (semester 1 through semester 4) perform significantly differently on an integrative writing task on seven morphosyntactic features of German?
Is learners’ performance on discrete point grammar items significantly different than on an integrative writing task at each point of development on seven target morphosyntactic features of German (3a)? And, if yes, is there a significant correlation between learners’ performance under the two task conditions (3b)?
Is there a significant difference at four points of development on learners’ performance on the essay task in terms of fluency and complexity?
IV Methodology
1 Participants
Two hundred and forty-five American learners of German participated in this study at two public universities in the USA. At both institutions, participants comprised intact classes, each of which was taught by a different instructor. All participants signed waivers in accordance with human participants research. Both programs used a communicative approach to pedagogy, emphasizing language use for meaningful self-expression and authentic communication. Many textbook and workbook activities are input-based and completed by fill-in-the-blank manipulation of form, but there are exercises that foster reading and listening comprehension, vocabulary development and short (paragraph-length) writing. Both programs also supplemented instruction with authentic reading and listening materials, videos from the internet and movies.
While individual variation exists (see Larsen-Freeman, 2006; Thewissen, 2013; Vyatkina, 2012), the groups of students in each level could be considered relatively homogeneous as learners had to take placement tests if they learned German before joining the universities. The majority of students were completing a language requirement, and L2 exposure was mostly limited to the classroom and homework materials. Approximate placement on the Common European Framework scale is indicated for each group to provide a context for comparison and facilitate replication (see Table 1).
Distribution of participants across institutions and semester.
2 Tasks and timeline
In order to examine grammatical accuracy patterns, data were elicited with two types of tasks that differed in terms of language elicited: discrete-point and integrative tasks. Since the current investigation was conducted in an ecologically valid classroom environment, participants knew that they would be graded on both tasks. The data was collected as part of the final exam in each program and was administered at the end of the semester. All participants completed the same tasks to allow a comparison of performances across semester levels. The tasks were designed so that students from all four semester levels could complete them. Additionally, the tasks resembled tasks with which students were familiar from homework assignments and previous chapter exams. Likewise, all participants completed the discrete-point component of the test first, followed by the integrative writing task.
While the discrete-point activities do not constitute a task as understood in TBLT (Ellis, 2003), they are traditionally used in L2 programs (Jean & Simard, 2011) and are often used to determine students’ development of grammatical accuracy. The discrete-point tasks used in this study were contextualized around university life and provided obligatory contexts with only one possible answer each. Participants had to fill in the blank, complete a sentence, or put words into the correct order. Each sub-task comprised of 4–8 ‘problems’.
After the discrete-point tasks, participants completed the integrative task, a short blog-entry introducing themselves. The length of the writing assignment was not stipulated, but participants were provided with prompts in English to help generate ideas (e.g. where they were from, what they studied, and how they spent their free time). While time on task was not assessed for each learner, students took on average about 15 minutes to complete the writing task.
3 Measures and data analyses
The discrete-point tasks covered more grammatical features than what were used for analysis in this study. In order to allow for robust statistical cross-task comparisons, only those features that were produced by participants in both task types at all four levels were used for analysis. Thus, the focal grammar structures emerged from the data. For example, not every level used the conversational past or accusative articles in the writing task, so even though they were covered by discrete-point items, these features were not included in the analyses. The following grammatical structures emerged from the data as comparable across the two tasks:
conjugation of haben (‘to have’) and sein (‘to be’) as main verbs in the present tense (H/S; four fill-in-the-blank items each);
conjugation of regular verbs in the present tense, e.g. machen (‘to make/do’), spielen (‘to play’) (RVC; five fill-in-the-blank items);
conjugation of irregular verbs in the present tense, other than sein; e.g. fahren (‘to go, drive’), essen (‘to eat’) (IVC; five fill-in-the-blank items);
modal verb choice (MM; six sentences translated from English into German);
modal verb conjugation (MC; six sentences translated from English into German);
word order after a modal verb (MW; six sentences translated from English into German);
infinitive after a modal verb (MMI; six sentences to translated from English into German).
Items 1–3 required learners to complete maximally two transformations to produce. Regular verbs and the verb haben (‘to have’) required one transformation: inflecting the verb to match the subject (e.g.
Research question 1: Do learners at four points of development (semester 1 through semester 4) perform significantly differently on discrete-point items testing seven morphosyntactic features of German?
The individual items in each discrete-point task received one point for correct and zero for incorrect answers. The seven target grammatical features were then compared across four levels of instruction using a series of ANOVAs (one analysis for each grammatical feature).
Research question 2: Do learners at four points of development (semester 1 through semester 4) perform significantly differently on an integrative writing task on seven morphosyntactic features of German?
The essays were analysed for each of the seven target grammatical features by the two researchers and tallied for correct use. The researchers compared their individual coding of 10% of the data (25 sets of data were analysed by both researchers). Cohen’s kappa was run to determine interrater reliability, κ = .985, p = .000. Scores were subsequently compared across four levels of instruction using a series of ANOVAs (one analysis for each feature). The number of correct answers were divided by the number of attempted uses of each target grammatical form (yielding a percentage score for each participant) for the ANOVA analyses.
Research question 3: Is learners’ performance on discrete-point grammar items significantly different than on an integrative writing task at each point of development on seven target morphosyntactic features of German (3a)? And, if yes, is there a significant correlation between learners’ performance under the two task conditions (3b)?
Participants’ performance on the seven target grammatical structures was compared between the discrete-point and integrated items, using a series of t-tests. Since students did not produce the same number of regular, irregular or modal verbs in the integrative writing task, all raw scores were converted into percentages by dividing the number of accurate uses by the attempted uses of each target grammatical structure.
Research question 4: Is there a significant difference at four points of development on learners’ performance on the essay task in terms of fluency and complexity?
Following Norris and Ortega’s (2009) and Thewissen’s (2013) recommendation, each blog was evaluated for the three subconstructs of complexity: (1) overall complexity was measured by sentence length; (2) subordination was measured by counting attempted use of subordinate clauses; and (3) coordination was measured by counting attempted use of coordinate clauses. Further analyses measured attempted versus accurate word order after coordinating and subordinating conjunctions (incorrect meaning was not an issue in this corpus). To determine fluency, learners’ output was measured in number of words produced (Ong & Zhang, 2010).
V Results
Our first research question concerned the potential developmental progression among four levels measured by learners’ performance on discrete-point items of seven morphosyntactic features of German. One-way ANOVAs revealed that learners at each semester performed equally well on the tasks that assessed the conjugation of haben/sein (H/S) (F(3, 241) = .277, p = .842) and regular verbs (RVC) (F(3, 241) = .840, p = .473), choice of modal verb meaning (MM) (F(3, 241) = .689, p = .560), and correct word order following modal verbs (MW) (F(3, 241) = 1.80, p = .147). There seemed to be an early ceiling effect for the conjugation of regular verbs and the use of word order following modal verbs with accuracy ranging from 86–92% (Figure 1). Learners’ performance on haben/sein verb conjugation (H/C) ranged between 82 and 85%, while the correct choice of modal verb meaning (MM) remained between 71 and 75% through all four levels.

Mean percentage of correct performance of first through fourth semester learners on seven discrete point tasks.
Modal verb conjugation (MC) varied significantly across semesters, F(3, 241) = 3.41, p = .018; ω = .17. A Games–Howell post-hoc test with a 95% confidence interval revealed that first-semester (M = 67.31; [–27.70,–1.32]) and second-semester learners (M = 70.95; [–21.69, –.05]) attained significantly lower scores on this measure compared to third-semester learners (M = 81.82), but fourth-semester (M = 75.78) learners did not significantly outperform the other three levels.
Likewise, the correct use of infinitives (MMI) differed significantly across semesters, F(3, 241) = 3.92, p = .009; ω = .58. A Games–Howell post-hoc test (CI 95%) revealed that second-semester learners had significantly fewer correct responses (M = 72.64) compared to third-semester (M = 86.36 [–25.66, –1.80]) and fourth-semester learners (M = 84.67 [–22.89, –1.17]). First-semester learners performed at a statistically similar level of accuracy (M = 80.77) as learners in the third and fourth semesters (but not significantly differently from second-semester learners); third- and fourth-semester learners showed no significant difference in their responses. While the ANOVA on irregular verb conjugation (IVC) approached a level of significance, F(3, 241) = 2.51, p = .059, performance was similarly varied across the levels (62–73% accuracy). As with modal verb conjugation and infinitives in modal constructions, performance was not linear across the levels (Figure 1): accuracy dipped for second- (M = 62.16) and fourth-semester (M = 63.47) learners as compared to first-semester (M = 71. 15) and third-semester (M = 73.18) learners.
The second research question pertained to the seven target morphosyntactic features of German at the four chronological levels on the integrative writing task. The results indicated that, unlike on the discrete-point task, learner performance did not significantly vary across four semesters. This was the case for the correct use of haben/sein (H/S) (F(3, 241) = .385, p = .764), the conjugation of regular verbs (RVC) (F(3, 241) = 1.63, p = .184) and irregular verbs (IVC) (F(3, 241) = .99, p = .400), the choice of modal verb meaning (MM) (F(3, 241) = .833, p = .480), the conjugation of modal verbs (MC) (F(3, 241) = 1.10, p = .351), the correct word order following modal verbs (MW) (F(3, 241) = 2.19, p = .092), and the infinitive following a modal verb (MMI) (F(3, 241) = 1.49, p = .220). While correct use of the L2 grammatical features was very high (82–98% accuracy), there seemed to be a stronger developmental trend from first- through fourth-semester on accurate word order following a modal verb (MW) and accurate use of infinitive verbs in modal verb constructions (MMI), ranging from 76% and 79% correct use by first-semester learners (respectively) to 90% correct use by fourth-semester learners (Figure 2).

Mean percentage of correct performance of first through fourth semester learners on the integrative writing task.
The third research question compared learners’ performance on the seven discrete-point grammar and the integrative writing task at each chronological level. Additionally, the analysis explored the correlation in performance on the two tasks. Paired-samples t-tests on each of the seven grammatical structures showed that, with the exception of fourth-semester learners’ conjugation of regular verb (RVC), when performance on the discrete-point and on the integrative writing task was significantly different, grammatical accuracy was higher on the integrative writing task. Table 2 shows that participants conjugated haben and sein (H/S) significantly more accurately in the integrative writing task. This was the case for first-semester (M = 96.23 vs. M = 82.21), second-semester (M = 95.32 vs. M = 77.45), third-semester (M = 98.49 vs. M = 85.51), and fourth-semester learners (M = 96.93 vs. 85.33). Effect sizes were large. Only for second- and fourth- semester learners were the discrete-point and integrative task performance significantly related.
Performance comparisons (paired samples t-test) and performance correlations (Pearson’s correlation) for the conjugation of sein and haben as main verbs (H/S).
Notes. d = effect size Cohen’s d; r = Pearson’s correlation coefficient; *p < 0.05, ***p < 0.001.
As the sole exception, fourth-semester learners produced more correct regular verb conjugations on discrete-point items (M = 92.80) than on the integrative writing task (M = 86.58), as Table 3 illustrates (medium effect size). Learners’ performance on the discrete-point and integrative tasks were significantly related for all four semesters.
Performance comparisons (paired samples t-test) and performance correlations (Pearson’s correlation) for conjugation of regular verbs (RVC).
Notes. d = effect size Cohen’s d; r = Pearson’s correlation coefficient; ***p < 0.001.
The comparison of performance scores of irregular verb conjugation (IVC) on the two tasks showed that learners used irregular verbs more accurately on the integrative writing task. This was the case for first-semester (M = 95.37 vs. 71.42), second-semester (M = 93.86 vs. M = 61.71), third-semester (M = 90.27 vs. 70.77) and fourth-semester learners (M = 90.71 vs. M = 63.01). All effect sizes were large. Only first- and fourth-semester learners’ performance on the two tasks was significantly correlated (Table 4).
Performance comparisons (paired samples t-test) and performance correlations (Pearson’s correlation) for conjugation of irregular verbs (IVC).
Notes. d = effect size Cohen’s d; r = Pearson’s correlation coefficient; ***p < 0.001.
Likewise, participants at each level chose the correct meaning of the modal verb (MM) more often on the integrative writing task than on the discrete-point task (Table 5). This was the case for first-semester (M = 89.65 vs. 70.68), second-semester (M = 90.37 vs. M = 72.59), third-semester (M = 94.22 vs. 75.42), and fourth-semester learners (M = 96.10 vs. M = 72.22). All effect sizes were large. Performance on the two tasks was significantly correlated only for first-semester learners.
Performance comparisons (paired samples t-test) and performance correlations (Pearson’s correlation) for correct choice of modal verbs (MM).
Notes. d = effect size Cohen’s d; r = Pearson’s correlation coefficient; ***p < 0.001.
Learners’ correct conjugation of modal verbs (MC) was significantly better on the writing task in the first semester (M = 89.66 vs. M = 64.94) and second semesters (M = 93.52 vs. M = 72.59), with large effect sizes; yet, scores on the two tasks were not significantly related (Table 6).
Performance comparisons (paired samples t-test) and performance correlations (Pearson’s correlation) for correct conjugation of modal verbs (MC).
Notes. d = effect size Cohen’s d; r = Pearson’s correlation coefficient; ***p < 0.001.
Accurate word order following the modal verb (MW) did not significantly differ on the two tasks, nor was performance significantly related (Table 7). Accuracy scores in the two tasks were also not significantly related.
Performance comparisons (paired samples t-test) and performance correlations (Pearson’s correlation) for correct word order following a modal verb (MW).
Notes. d = effect size Cohen’s d; r = Pearson’s correlation coefficient.
Likewise, the correct use of an infinitive in the modal verb construction (MMI) did not differ significantly on the two tasks, although fourth-semester learners’ better performance on the integrative writing task almost reached a level of significance (M = 90.19 vs. M = 83.33), with a medium effect size (Table 8). First- and third-semester learners’ scores were significantly correlated.
Performance comparisons (paired samples t-test) and performance correlations (Pearson’s correlation) for correct infinitive following a modal verb (MMI).
Notes. d = effect size Cohen’s d; r = Pearson’s correlation coefficient; *p < 0.05.
The fourth research question investigated potential significant differences across levels on the writing task in terms of fluency and complexity. One-way ANOVAs for fluency indicated that the blog entries became significantly longer across four semesters, F(3, 240) = 20.207, p = .000; ω = .44. A Games–Howell post-hoc test (CI 95%) revealed that first- and second-semester learners produced about the same amount of words in their essays, while both third- (M = 102.52) and fourth-semester (M = 119.270) learners produced significantly more words on the integrative writing task compared to first-semester (M = 79.01; [–40.81,–6.19]/[24.20, 56.29]) and second-semester (M = 83.06; [3.06, 35.84]/[21.17, 51.22]) learners. The essays of third- and fourth-semester learners did not differ significantly in length (Figure 3).

Number of words produced across four semesters.
Complexity was measured with three subconstructs: sentence length, the use of subordinate clauses, and the use of coordinate clauses. Significant differences were evident for complexity as measured by sentence length (F(3, 240) = 14.746, p = .000; ω = .38). A Games–Howell post-hoc test (CI 95%) revealed that first- and second-semester learners produced similar numbers of words per sentence in their essays. Third-semester learners (M = 6.80) produced significantly longer sentences in their essays as compared to first-semester (M = 5.26; [.81, 2.25]) and second-semester learners (M = 5.68; [.41, 1.82]). Likewise, fourth-semester learners (M = 6.56) produced significantly longer sentences compared to first-semester [.66, 1.92] and second-semester [.26, 1.50] learners. Third- and fourth-semester learners’ sentence length did not differ significantly.
The number of simple sentences learners attempted, F(3, 240) = 1.011, p = .389, and those which they produced error-free (F(3, 240) = .893, p = .446), did not vary significantly across semesters. However, coordinated sentence use differed significantly across semesters, in terms of the number of sentences with coordination that learners attempted (F(3, 205) = 5.79, p = .001; ω = .25), as well as the number of sentences with coordination that learners produced accurately (F(3, 205) = 5.058, p = .002; ω = .23). A Games–Howell post-hoc test (CI 95%) revealed that fourth-semester learners (M = 3.00) attempted to produce coordinated sentences significantly more often than first-semester (M = 1.88; [.31, 1.92]) and second-semester learners (M = 2.29; [.04, 1.38]). There was no significant difference in attempts of coordinated sentences between first- and second-, first- and third-, and third- and fourth-semester learners. A Games–Howell post-hoc test (CI 95%) revealed that fourth-semester learners (M = 2.51) also produced sentences with coordination significantly more accurately than first-semester (M = 1.48; [.25, 1.80]) learners. Correct sentences with coordination did not differ significantly between any other groups of learners (second-semester M = 1.87; third-semester M = 2.17).
Finally, the analysis regarding the attempted use of sentences with subordination showed a significant difference between groups (F(3, 127) = 21.779, p = .000; ω = .56). A Games–Howell post-hoc test (CI 95%) revealed that first-semester learners (M = .32) attempted significantly fewer sentences with subordinate clauses than second-semester (M = 1.68; [–2.14, –.57]), third-semester (M = 2.41; [–2.81, –1.37]) or fourth-semester learners (M = 2.39; [–2.66, –1.46]). Also, regarding the correct use of sentences with subordination, results indicated significant differences, F(3, 127) = 13.230, p = .000;ω = .42. A Games–Howell post-hoc test (CI 95%) revealed that first-semester learners (M = .10) produced significantly fewer sentences with subordination accurately as compared to second- (M = 1.09; [–1.81, –.17]), third- (M = 1.79; [–2.41, –.97]), and fourth-semester learners (M = 1.55; [–1.97, –.93]) (Figure 4).

Number of correctly used structures on the integrative essay task across four semesters.
Thewissen (2013) provides a useful organizational framework for presenting patterns in development across levels as strong (between adjacent proficiency levels), weak (significant difference between non-adjacent proficiency levels), and non-progressive (non-significant differences across proficiency levels). The results of this study are presented following her model in Table 9.
Summary of results.
Notes. strong = significant difference between adjacent proficiency levels; weak = significant difference between non-adjacent proficiency levels; non-progressive = no significant differences between/among proficiency levels.
VI Discussion
Many practitioners and textbook authors work with the assumption that performance improves linearly across language levels, even if it is not born out by research (Jean & Simard, 2011). The current investigation sought to delve further into this discussion by analysing a corpus of 23,721 words produced by language learners across four chronological levels on traditional assessment tasks found in current textbooks and attendant testing manuals: discrete-point tasks, one for each grammatical structure, and an integrative writing task.
The findings echoed other research regarding the complexities of L2 development, backsliding, and inconsistent progression (Thewissen, 2013; Verspoor et al., 2012; Vyatkina, 2012; 2013), and the study offers further confirmation that student performance depends on several interacting factors, including the task type used to elicit learner language (Robinson, 2005; Skehan, 1998) and the target grammatical structure. Comparing student performance across two task types allowed us to determine variability within individual learners, and collecting data cross-sectional across four chronological levels resulted in a quasi-variability analysis.
The current data suggested that while students’ use of modal verbs (accurate conjugation and the use of infinitives) showed strong improvement between second and third semesters, performance patterns were also marked by backsliding in the second (use of infinitive) and fourth semesters (conjugation). Similarly, irregular verb conjugation revealed a zigzag-shaped pattern, as first- and third-semester learners performed better than their second- and fourth-semester peers.
The non-progressive and zigzag trend lines appears disappointing from a curricular perspective. Nevertheless, measures of accuracy with discrete-point items might be misleading, masking the fact that some grammatical features (conjugation of haben and sein with 82–85% accuracy, and regular verb conjugation with 88–92% accuracy) may have been acquired early, reaching a ceiling effect quickly (Thewissen, 2013), possibly a result of high-frequency input. In contrast, the zigzag developmental pattern of modal verb and irregular verb conjugation may indicate limited available input in a foreign language setting, or they may need to be learned as individual items at beginning levels of L2 instruction.
In the integrative writing task, none of the grammatical features exhibited improved accuracy, although four grammatical structures trended that way. Two aspects of modal verb constructions (word order and the use of infinitives) seemed to improve across first and second semesters, as did regular verb conjugation between the first and third semesters. However, modal verb conjugation decreased in accuracy between the second and fourth semesters.
Measures of accuracy may also have masked another phenomenon. Beginning learners might have used (semi-)fixed expressions or memorized chunks of language accurately (Larsen-Freeman, 2003), while similar accuracy scores may reflect more original L2 use among intermediate learners. Others (Boers, et al., 2006; Crossley et al., 2014; Gass, et al., 2013; Serrano, et al., 2015), however, find that such an explanation is unlikely, since the ability of beginning L2 learners’ to use chunks accurately is limited (accuracy correlates with proficiency and amount of exposure to the L2), in either form or meaning.
In fact, when learners’ apparent lack of accuracy on the writing task is examined in conjunction with the complexity measures, the data suggested that the use of coordination and fluency, sentence length, and subordination improved throughout the four levels, mirroring previous research (Ellis & Yuan, 2004; Vyatkina, 2012). Moreover, third- and fourth-semester students wrote significantly longer blog entries (fluency measure), which might have resulted in less time for proofreading. It is also possible, however, that the present findings support Skehan’s Limited Attentional Capacity Model (1998) in that an increase in more complex L2 use may have negatively affected accuracy.
Findings regarding significant differences in learners’ performance between the two task conditions also illustrated the complexity of developmental progress: accurate performance on one task was not consistently related to accurate performance on the other task. In fact, no pattern of related performance on the two tasks was evident in the data. Thus, factors other than knowledge of grammar use seemed to affect learners’ performance. The current data analysis suggested that participants performed better on the integrative task than on discrete-point items, countering previous findings that indicated no effects for task type (Kuiken & Vedder, 2008).
Two possible explanations emerge for these results. First, when learners focused on expressing meaning, as was the case in the integrative writing task, the message drove the language required to express it (Ellis, 2003), yielding, for example better accuracy in modal verb constructions in the writing task (about 90%, compared to 70% on the discrete point task). Since the writing task was personalized and did not require the integration of external perspectives, the communicative message may have been clearer in the learners’ mind, allowing them to map language onto ideas more easily. In contrast, however contextualized the discrete-point items might have been for the developers of the test, learners had to make sense of new ideas in each task and individual item. In other words, learners had to assign grammatical forms to ideas and sentences prepared by test developers instead of their own ideas in the writing task. The lack of readily apparent coherence may have resulted in learners’ having to focus on new meaning with every sentence, at the expense of accuracy. This interpretation would seem to support the Limited Attentional Capacity Model (Skehan, 1998; Skehan & Foster, 2001): as the cognitive demand of a task increased, there appeared to be less attentional capacity available, and learners prioritized meaning over form. In the present study, at least, the allegedly more complex integrative writing task seemed easier for learners to navigate, and the discrete-point items were more challenging due to a lack of context (i.e. the cognitive demand of the discrete-point sentence comprehension was higher than the integrative essay task).
Second, learners’ higher performance on the written task might be attributable to their ability to manage carefully what forms they used (mostly first and third person), whereas the discrete-point items required all verb forms. As Larsen-Freeman (2003) observed, learners are masters of avoidance, and can select language forms with which they are comfortable in order to express ideas creatively and effectively, ignoring others. Thus, measures of accuracy can be misleading on integrative tasks, reflecting participants’ ability to make linguistic choices freely, rather than work within the parameters of obligatory contexts.
VII Conclusions and implications
In order to contextualize the conclusions and implications, a couple of limitations in this study need to be acknowledged. First, the study is cross-sectional instead of longitudinal; therefore, development is inferred rather than directly observed (see Thewissen, 2013; Verspoor et al., 2011). Second, the target grammatical features emerged from the data rather than being pre-determined by the researchers, to allow for cross-task comparisons. While this is more naturalistic in one way, it also excluded certain morphosyntactic features (e.g. pronouns, various cases) from analysis. Third, in this study discrete-point items always preceded the integrative written task; a different sequence may have led to different results due to potential practice effects. In spite of these limitations, this large-scale study makes significant contributions to the understanding of L2 use at beginning and intermediate levels of proficiency, an area as yet largely neglected (Vyatkina, 2013).
Several pedagogical implications emerge from the results. First, the relatively quick acquisition of most of the grammatical features observed in this study suggest that learners’ attention could be directed sooner to other aspects of L2 learning (e.g. lexical development) than current textbooks and curricula accommodate. Second, the present findings also support Chavez’ (2013) recommendation that L2 learning and assessment in college classrooms undergo a fundamental change, moving away from a cognitively-oriented training approach to SLA to one that prioritizes meaning-making. Third, important philosophical issues are raised as textbooks continue to rely heavily on easily gradable fill-in-the-blank exercises and dehydrated sentences, greatly skewing towards sentence-level practice and assessment. Yet, such exercises might misrepresent what learners can perform, potentially under-representing their L2 abilities; they may also discourage learners from using contextual clues and from co-constructing knowledge with language. Fourth, performance measured by one type of task does not seem to discriminate effectively between grammatical abilities. Instead, multiple task types need to measure grammatical performance. Based on this study, the seemingly more demanding task (because it requires that learners provide all context, lexicon and linguistic features) seemed also to be the more accurate indicator of learners’ ability to express meaning.
Several implications arise for research as well. First, the present findings suggest that learners may be able to apply grammatical features more effectively in meaningful, open-ended communication. More research is needed to examine the relationship between elicitation prompts and learners’ ability to put learned material to L2 use; perhaps some of the lacking ability evidenced in previous research is due to the way we measure performance rather than learner avoidance. Crucially, the less accurate results in this study on discrete-point items may have been a function of less effective comprehension by the learners. Potential discrepancies between task-design assumptions and learner interpretation or comprehension (even of individual test items) merit further investigation.
Second, for improved understanding of L2 development in German, as well as for programmatic recalibration (Swaffar & Arens, 2005), further research is needed to benchmark various linguistic features beyond the ones analysed here. The present study identified some linguistic features that learners can be expected to have control of (e.g. conjugation of regular verbs, haben and sein) and partial control within the first year (e.g. irregular verbs), but more details could serve as helpful guideposts. Finally, this study highlighted the limitations of typical assessment materials for measuring learners’ knowledge of grammar. Since assessment often guides, and reciprocally informs, curricular design, both need fundamental changes to meet the challenges of learners’ needs in an interconnected world.
Footnotes
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
