Abstract
Adults demonstrate difficulty and pronounced variability when developing second language (L2) grammatical knowledge and reading skills. We examine explanations in terms of individual differences in working memory (WM). Despite numerous studies, the association between WM and adult second language (L2) acquisition remains unclear, and longitudinal studies are scarce and contradictory. This study investigates whether WM affects L2 grammar and reading development in beginning classroom learners, using WM tests with (Waters and Caplan’s 1996 test) and without (Daneman and Carpenter’s 1980 test) a demanding processing task. In Experiment 1, 82 beginning first language (L1) English learners of Spanish completed Daneman and Carpenter’s test, and grammar and reading pretests and posttests one year apart. In Experiment 2, 330 beginning English learners of Spanish completed the same tests as in Experiment 1 and Waters and Caplan’s test. The results reveal that only Waters Caplan’s test (response time, recall span) yielded WM effects, and that response time (processing) negatively correlated with recall span (storage). These findings reveal longitudinal WM effects on L2 grammar and reading development at early acquisition stages, support resource-sharing WM models, and urge scholars to adopt WM tests with a processing task performed under timed conditions, and to analyse response time.
I Introduction
More than 40 years of research since Baddeley and Hitch’s 1974 seminal proposal to distinguish between short-term and working memory (WM) provide ample evidence that WM modulates a variety of complex cognitive processes, including reasoning, problem solving, planning, abstraction, and first language comprehension and production (Logie and Cowan, 2015). However, despite requiring additional computation and activation (Ardila, 2003; Dornič, 1980; Hasegawa et al., 2002), the role of WM on adult second language (L2) acquisition remains unclear (for reviews, see Ardila, 2015; Juffs and Harrington, 2011; Linck et al., 2014; Sagarra, 2012).
Although L2 studies on WM have steadily increased in the last three decades, only a few have investigated WM effects on adult L2 acquisition in a within-participants longitudinal design, and they have generated opposite findings (WM effects: Faretta-Stutenberg, 2014; LaBrozzi, 2011; Linck and Weiss, 2011; no WM effects: Frost et al., 2013; Sagarra, 2000). Because these studies employ different research designs, scoring methods, statistical analyses, proficiency levels, linguistic tasks, and WM tests, it is virtually impossible to disentangle what factors yield the discrepant results.
This study investigates WM effects on L2 grammar and reading development in adult beginning classroom learners, via a large sample pool, a within-participants pretest–posttest design over long time periods (Experiment 1: one year; Experiment 2: one semester), and three WM measures: (1) recall span of a WM test with a self-paced easy processing task (Daneman and Carpenter’s test, DC henceforth); (2) recall span of a WM test with a difficult processing task performed under time pressure (Waters and Caplan’s 1996 test, WC henceforth); and (3) response time of the latter WM test.
The comparison of (1) and (2) is critical to evaluate the impact of processing demands on information maintenance: do more difficult processing tasks result in reduced recall span? The association between (2) and (3) is definitive for understanding the relationship between processing and storage: do recall span and response time yield similar results? and do processing and storage depend on shared or on independent resources? The answer to these questions will advance our understanding of current WM tests and will provide recommendations on what measures future L2 studies should employ to investigate WM capacity.
1 Working memory
WM is the activation and maintenance of short-lived memory items while performing sometimes complex and time-consuming cognitive tasks (Barrouillet and Camos, 2007). According to resource theories (Navon, 1984), humans have a limited capacity for performing mental work, and performance depends on the amount of resources available and on task demands, with more complex tasks resulting in poorer performance. Within this framework, WM models have moved from structure-oriented approaches (domain-specific) concerned with WM organization to process-oriented approaches (domain-free) centered on WM functioning (Engle and Oransky, 1999; for a review, see Cowan, 2015).
Domain-specific approaches hold that limitations in WM capacity constrain L2 learning. They are divided into single-resource models, which assume a trade-off between processing and storage (Case, 1985; Just and Carpenter, 1992), and multiple-resource models, which – like domain-free models – argue that processing and storage are independent from each other (e.g. Baddeley, 2003, 2007, 2015; Baddeley and Hitch, 1974; Baddeley and Logie, 1999). Both single- and multiple-resource models coincide on the existence of a central processor that filters the input, allocates attention to the filtered input, switches between tasks, and determines what knowledge is activated in long-term memory. The multifaceted nature of these executive functions has led some scholars to separate this central processor into several resources (Miyake et al., 2000; Waters and Caplan, 1996). In contrast, domain-free approaches do not make a distinction between processing and storage and conceive WM as the activated part of long-term memory rather than as a set of cognitive structures (Cowan, 1995, 1999; Engle et al., 1992, 1999; Lovett et al., 1999; Schneider and Detweiller, 1987).
Importantly, both domain-specific and domain-free approaches conflate cognitive load with task complexity. In other words, it is the number of items to be simultaneously retained that ultimately makes a WM test difficult, so the more demanding the processing component, the better the WM test. This proposal of a time-related decay of memory traces constitutes the core of Towse and Hitch’s model (1995; Hitch et al., 2001; Towse et al., 1998, 2000, 2002). According to this model, when participants switch their attention from storage to processing when performing WM tests, they are unable to activate stored items until attention to processing ends. Thus, the duration of the processing determines recall performance and test difficulty (see also Halford et al., 1994). In these models, time affects storage because stored items are forgotten over time through decay, but time does not affect WM functioning.
In contrast, Barrouillet et al. (2004) propose that the resource-sharing that occurs between processing and storage is time-based. For this model, test difficulty (i.e. cognitive load) does not depend on its complexity or on the duration of the processing activity, but on ‘the proportion of time during which an activity captures attention in such a way that the refreshment of memory traces or any other activity that requires attention is impeded’ (Barrouillet and Camos, 2007: 63). Thus, WM tests are not arduous because they are complex but because they encompass time pressure. Therefore, a WM test with a simple processing task requiring continuous attentional focusing can be cognitively demanding.
2 Working memory tests
Reading span is the most widely used index of WM span (Towse et al., 2008). In its original version (DC’s 1980 test), the reading span test asked participants to read sets of separate sentences aloud at their own pace while remembering the final word of each sentence in the set. This test is problematic because its processing task is not sufficiently cognitively complex. This is so, because self-paced WM tests require difficult processing tasks to generate enough time pressure (Lépine, Barrouillet, and Camos, 2005), and reading aloud can be an easy task if comprehension is not involved (the typical situation of reading aloud in public without understanding the meaning). In 1996, Waters and Caplan modified the test to ensure a taxing processing activity, by asking participants to make plausibility judgments after each sentence under time pressure. This change generated three scores: recall span, logicality judgment, and response time. There is a growing body of research confirming the importance of timing measures both to predict WM span (see Section I.1), and to offer an insight into WM functioning impossible to obtain via recall measures (e.g. Cowan et al., 1994; Tehan and Lalor, 2000). Despite this evidence, most L2 studies only report recall data. This study addresses this limitation by analysing both recall span and response time.
3 Cross-sectional studies on WM and L2 grammar and reading abilities
Artificial language studies show a robust connection between WM and L2 grammar learning (e.g. Kempe et al., 2010; Martin and Ellis, 2012; Misyak and Christiansen, 2012). However, more ecologically valid natural language studies have produced mixed findings (for a review, see Alptekin and Erçetin, 2010). We explain these discrepant results in terms of task demands, L2 proficiency, and WM test’s cognitive load, with (WC’s test) or without (DC’s test) a cognitively demanding measure.
Regarding task demands, Santamaria and Sunderman (2015) found an association between WC’s test and beginners’ performance on a cognitively demanding explicit production task (filling-in-the blanks), but not on a simpler implicit interpretation task (listening to sentences and choosing the correct picture). Concerning the association of WM effects with L2 proficiency and type of WM test, there is a large body of evidence showing a strong relationship between L2 proficiency and WM effects on various L2 areas (for a review, see Mitchell et al., 2015). Regarding L2 grammar and reading abilities, both WC’s and DC’s tests are associated with L2 reading comprehension in low proficient learners (WC’s test: Leeser, 2007; Walter, 2004: stronger effects in low than high proficient learners; DC’s test: Abu-Rabia, 2003; Harrington and Sawyer, 1992), but DC’s test is not associated with L2 grammar learning (Sagarra, 2000). Importantly, DC’s association with L2 reading comprehension is probably due to the reading nature of the reading span test (testing reading through reading). In contrast, in studies with higher proficiency learners, there are no WM effects on L2 reading comprehension (DC’s test: Chun and Payne, 2004) or L2 grammar learning (WC’s test: Gilabert and Muñoz, 2010; DC’s test: Côté, 2016), regardless of the WM test used. Because studies employing a WM test with a taxing processing component in lower proficiency learners reveal WM effects, and those adopting a WM test without a demanding processing task and higher proficiency learners don’t, it is unclear what causes WM effects. To untangle the effects of WM test and L2 proficiency, this study compares WM tests with and without a difficult processing activity with the same pool of low proficiency learners.
Online studies on WM and L2 morphosyntactic and syntactic processing provide further evidence that L2 proficiency modulates WM effects (Coughlin and Tremblay, 2013). Thus, studies reporting WM effects had lower proficiency learners (e.g. morphosyntax: Sagarra, 2008; syntax: Dai, 2015; Havik et al., 2009; Miyake and Friedman, 1998) than those claiming no WM effects (e.g. morphosyntax: Foote, 2011; syntax: Felser and Roberts, 2007; Juffs’ studies, summarized in Juffs and Harrington, 2011; Rodriguez, 2008). Notably, two online studies comparing different proficiency levels suggest a triple relationship between WM, L2 proficiency, and task demands: Sagarra and Herschensohn (2010) found WM effects on sensitivity to morphosyntactic violations in less than more proficient learners, and Service et al. (2002) reported stronger WM effects on L2 syntactic processing in less than more proficient learners. These studies suggest that differences in L2 proficiency are influenced by task demands: in tasks involving simple adjacent agreement within a DP (Sagarra and Herschensohn, 2010), differences are obvious at low proficiency levels (beginners vs. intermediates), whereas in those containing complex embedded relative clauses (Service et al., 2002), differences are evident at high proficiency levels (advanced vs. near-native). In sum, this section’s offline and online cross-sectional studies suggest that WM modulates L2 grammar and reading abilities and that WM effects are influenced by three factors: cognitive demands of the processing component of WM tests, L2 proficiency, and task demands. Next, we describe longitudinal studies.
4 Longitudinal studies on WM and L2 grammar and reading development
Evidence that individual differences in WM predict L2 grammar and reading development longitudinally is controversial. In the first longitudinal study, Sagarra (2000) asked beginning learners to complete DC’s test, as well as an explicit grammar pretest and posttest (multiple-choice grammar test) one year apart, and found no WM effects. Similarly, Frost et al. (2013) asked intermediate and advanced learners to complete morphological processing and semantic priming tests, a non-linguistic statistical learning task, and a verbal WM test developed by the Israeli National Institute for Testing and Evaluation, among other cognitive tasks, concluding that WM was not involved in L2 morphosyntactic processing. Importantly, (1) the WM test was non-standardized, making it is impossible to determine how taxing it was, and (2) the statistical learning scores only explained 20% of variance, leaving 80% of variance unexplained.
In contrast with these studies, longitudinal studies employing standardized WM tests with a taxing processing measure and low proficiency learners report WM effects on the L2 reading and grammar development. For L2 reading development, Kormos and Safar (2008) found a relationship between a backward digit span test and novice learners’ end-of-semester overall L2 competence and reading skills, but they did not have a proficiency pretest and could not determine how much learning, if any, occurred. In 2012, Linck et al. addressed this limitation by following the same beginning learners for 64 weeks, and reported an association between an operation span task and global measures of reading, speaking and listening.
For L2 grammar learning studies with explicit tasks, in Linck and Weiss (2011, 2015), beginners completed an operation span WM test, a Simon’s inhibitory control test, a metalinguistic grammar test, and a vocabulary test, and found WM, but not inhibitory control, effects on both tests. Despite the limitations of this study (mixed grammar and vocabulary scores, mixed proficiency levels ranging from first- to third-semester learners, a low sample size, mixed first languages (L1s), and a brief elapsed time of 8 weeks between the pretest and the posttest), its results mirror those of the other studies using taxing WM tests and low proficiency learners. For example, Serafini and Sanz (2016) reported a connection between an operation span test and performance on both a cognitively simple explicit task (grammaticality judgments) and a cognitively complex implicit task (elicited oral imitation) in lower, but not higher, proficiency learners over one semester. Similarly, Grey et al. (2015) did not obtain WM effects with advanced study abroad learners in a grammaticality judgment or a lexical decision task over 5 weeks, using WC’s test. Also, WM effects on L2 grammar learning have also been obtained with implicit tasks. Thus, Sanz et al., (2014) found a relationship between a listening span test adapted from WC’s test and ab initio learners’ performance over two weeks in a cognitively simple implicit task (reading or listening to a sentence and choosing one of two pictures) following meaning-focused practice with explicit feedback, but not in a less cognitively demanding explicit task (grammaticality judgment task), or treatment (grammar lesson with explicit feedback).
L2 morphosyntactic processing studies also reveal WM effects with both explicit and implicit tasks. First, Faretta-Stutenberg (2014) asked at home and abroad low intermediate learners to complete an operation span test and an event-related potential cognitively simple explicit task (grammaticality judgments), and found that WM accounted for gains in overall proficiency sensitivity to morphosyntactic agreement violations in the abroad, but not the stay at home group. Second, in LaBrozzi (2011) at home and abroad low intermediate learners completed WC’s test and a cognitively simple implicit eye-tracking task (reading sentences with comprehension questions) one semester apart, and found WM effects in the study-abroad learners.
To summarize, longitudinal studies employing a WM test with a taxing processing component and low proficiency learners reveal WM effects on L2 reading development (Kormos and Safar, 2008; Linck et al., 2012), L2 grammar learning with explicit (Linck and Weiss, 2011, 2015; Serafini and Sanz, 2016) and implicit tasks (Sanz et al., 2014), and L2 morphosyntactic processing with explicit (Faretta-Stutenberg, 2014) and implicit tasks (LaBrozzi, 2011). In contrast, longitudinal studies containing a taxing WM test and low proficiency learners (explicit task: Sagarra, 2000), or a non-taxing WM test and high proficiency learners (explicit task: Grey et al., 2015; Serafini and Sanz, 2016; implicit task: Frost et al., 2013) show no WM effects on L2 grammar learning. Notably, Sanz et al. (2014) found no WM effects on L2 grammar learning with a taxing WM test, a low proficiency level, and an explicit task, probably because the task was very simple and might have been too easy even for beginning learners.
The results of these longitudinal studies mirror those of the cross-sectional studies, and, taken together, demonstrate that it is WM test’s cognitive load, L2 proficiency, and task demands that account for variability in WM effects on L2 processing of morphosyntax and syntax, grammar learning, and reading comprehension (see Figure 1). It cannot be task explicitness that accounts for it (as proposed by scholars such as Engle, 2002; Reber et al., 1991; Roberts, 2012), because there are WM effects and no WM effects with both explicit and implicit tasks.

Summary of cross-sectional and longitudinal studies on working memory (WM) and second language (L2) grammar and reading abilities.
II The study
As shown in the literature review, research on WM and second language acquisition (SLA) is abundant but it is still unclear whether WM predicts SLA longitudinally. To shed light onto this issue, this study investigates whether WM impacts L2 grammar learning and reading comprehension longitudinally in beginning classroom learners, based on (1) recall span of a WM test lacking a taxing processing task (DC’s test), (2) recall span of a WM test with a demanding processing task (WC’s test), and (3) response time of the latter test. Also, this study examines whether processing (response time) and storage (recall span) are associated. We predict that both recall span and response time will predict grammar and reading development in WM with, but not without, a demanding processing task. This hypothesis is based on research revealing longitudinal WM effects on L2 grammar learning (Linck and Weiss, 2011; Sanz et al., 2014; Serafini and Sanz, 2016) and L2 reading development (Kormos and Safar, 2008; Linck et al., 2012) with WM tests with an arduous processing activity, but no effects with a WM test lacking a taxing processing component (Sagarra, 2000). Finally, we anticipate that processing will negatively correlate with storage, following the approaches arguing for a processing-storage trade-off described in Section I.1.
We concentrate on beginners, because lower proficiency learners consume more WM resources than higher proficiency ones (e.g. Harrington and Sawyer, 1992; Mitchell et al., 2015; Roberts, 2012; Sagarra, 2012; Service et al., 2012). In turn, we focus on classroom learners and explicit tasks to assess grammar learning, because some scholars argue that WM effects are more evident in explicit tasks (e.g. Roberts, 2012). Finally, we use reading span tests, instead of non-verbal WM tests, because the former have been more widely used than the latter in SLA studies, and because we wanted to compare two tests that only differed in their processing demands.
III Experiment 1
1 Participants
The participants were L1 English learners of Spanish, enrolled in a second semester (pretests) and, later, fourth semester (posttests) of Spanish at a large North American university. Classes met four days a week for 50 minutes each day, and were conducted entirely in Spanish by instructors using the same syllabus, exams, textbook, and metho-dology. The initial sample pool consisted of 382 participants, but the final sample pool was reduced to 82 participants. This severe attrition was mostly caused by students not taking third- and fourth-semester Spanish consecutively, not taking fourth-semester Spanish because some university units did not require 12 credits of a foreign language, or not completing all the tests. In addition, participants had to meet these requirements: to be native speakers of English, to have only lived in monolingual communities, to have not spent more than two weeks abroad, to belong to a household where English was the only language of communication, to have begun learning Spanish after age 8 in the school, to not use Spanish outside of class apart from homework, and to be 18–30 years old, because WM and processing speed start decreasing at the age of 40 (Park et al., 2003).
2 Materials and procedure
Participants received extra credit to complete a Spanish proficiency test (a group session prior to the second semester), a language background questionnaire and a WM test (an individual session during a 10-week period in the middle of the second semester), a grammar and reading pretest (three group sessions at the end of the second semester, one week apart), and a grammar and reading posttest (three group sessions at the end of the fourth semester, one week apart). The grammar test corresponded to Sagarra (2000), and the data were included in the study for comparison purposes with both the L2 reading tests of Experiment 1, and the L2 grammar and reading tests of Experiment 2.
a Screening tests: Spanish proficiency test and language background questionnaire
Participants completed the university’s Spanish placement test, which evaluated grammar (k = 27), vocabulary (k = 21), and reading skills (three passages, k = 22) in a multiple-choice format. Although the university did not provide individual scores for privacy reasons (it was an official test), the students were placed in second-semester Spanish based on their score. In addition, inferential statistics showed no differences between low and high span learners in the pretests (for further details, see Section III.4), indicating that the learners had the same proficiency level. The language background questionnaire included questions about the participants’ L1, knowledge of other languages, exposure to other languages during childhood, age of acquisition of Spanish, experience with Spanish in high school, and use of Spanish outside of class.
b Working memory test (Daneman and Carpenter, 1980)
The test was conducted individually in the participants’ L1 because deficits in L2 knowledge could affect the results, and because WM seems to be language-independent (Osaka and Osaka, 1992; Xue et al., 2004). The test involved participants reading aloud sets of sentences and recalling the last word of each sentence after each set. The test contained 60 unrelated sentences, 13 to 16 words in length. Following DC’s test, sentence series where shown in order: five sets of two sentences, five sets of three sentences, five sets of four sentences, five sets of five sentences, and three sets of six sentences.
Participants were asked to read aloud each sentence at a fairly quick pace (the researcher told participants to speed up when needed), remember the last word, and later on recall the last words of all the sentences a the end of each set. Words could be recalled in any order, but the first recalled word could not be from the very last sentence of that set. Also, participants continued until they failed three sets of a given level. Whenever participants failed recalling a word, they were encouraged to try harder next time. As in DC’s test, the last word was covered until the participant reached the end of the sentence, to avoid the possibility of looking at the last word before reading a sentence.
c Grammar test (pretest, posttest)
This test consisted of a sentence-level (Mecartty, 1993) and a paragraph-level test (University of Iowa’s Spanish placement test) measuring Spanish grammatical knowledge, such as subject–verb agreement, gender and number agreement, demonstrative and possessive adjectives and pronouns, prepositions, and adverbs of negation. The sentence-level test contained 12 sentences, and the paragraph-level test two cloze tests with 102- to 106-word passages with missing words (3 for cloze test 1, and 7 for cloze test 2). Both tests had four options per item (multiple-choice format) and a single correct answer, and participants had unlimited time and did not leave answers blank. These tests were selected because they were based on materials and a teaching methodology similar to those used in the participants’ Spanish courses. Also, they assessed grammatical knowledge from beginning to upper-intermediate proficiency levels, a crucial factor considering that the participants were tested twice two semesters apart.
d Reading test (pretest, posttest)
One week after the placement test, participants completed a reading test. The test was taken from Mecartty’s dissertation, for the same reasons mentioned above for using her grammar test. The test consisted of a 485-word text about the origins of the potato, followed by 9 multiple-choice questions evaluating reading comprehension. In line with the grammar tests, participants were encouraged to answer all questions and to guess when unsure.
3 Scoring
For the grammar and reading tests, participants received 1 point for each correct response and zero points for each incorrect response or no response. The final grammar score was a sum of the score on the Mecartty’s test and the University of Iowa’s test. Because the tests varied in the number of items, the raw scores were converted into percentage scores. For the WM test, the highest level at which a participant was able to recall all the final words in three out of five sets was taken as a span measure. For example, if someone remembered all words for three or more sentence sets at level 2, but only remembered all the words for one set at level 3, the reading span score was 2. A person who remembered all words for three or more sets at level 4 but for only one set at level 5 would receive a score of 4. If the participant was correct on only two out of five sets, the score would be intermediate between that level and the level immediately below. For example, if someone remembered all words for three or more sets at Level 2 and all words for two sets at Level 3, the score would be 2.5.
4 Results
The data were submitted to two types of statistical analyses: general linear model (GLM) and correlational analyses. Multiple analyses were adopted, heeding the inconsistency of statistical analyses observed among WM studies (Sagarra, 2012), and Linck and Weiss’ (2011) admonition to link longitudinal WM effects to differences in learning outcomes over multiple points in time instead of correlations. WM was entered as a continuous variable (individual WM scores), following Friedman and Miyake’s (2005) recommendation that researchers score the test with continuous measures (total number of words recalled or proportion of words per set averaged across all sets) and avoid the creation of arbitrary groups (e.g. high- vs. low-span).
Descriptive statistics suggested that posttest scores (grammar: M = 52.66; SD = 17.54; reading: M = 76.46; SD = 32.78) tended to be higher than pretest scores (grammar: M = 35.06; SD = 16.08; reading: M = 37.75; SD = 18.94). In addition, 82 participants took the grammar tests and had a WM mean score of 2.91 (SD = .66), and 79 completed the reading tests and had a WM mean score of 2.89 (SD = .63). The grammar and reading tests had different sample sizes because three participants were unable to complete the reading tests.
Two repeated-measures ANCOVAs were conducted with Time as a within-participant independent variable (Time 1 = pretest; Time 2 = posttest), and WM as a covariate: one for the grammar pretest and posttest scores, and one for reading pretest and posttest scores. The ANCOVAs showed a significant main effect for Time in the grammar test, F(1,80) = 8.105, p = .006, but not in the reading test, F(1,77) = 4.466, p = .038. A Bonferroni post hoc test revealed that, in the grammar test, participants scored higher in the posttest than the pretest. Also, there was no significant main effect for WM in the grammar test, F(1,80) = .124 p = .726, or the reading test, F(1,77) = .038 p = .846), suggesting that WM differences did not play a role. And there was no significant interaction of Time × WM in the grammar test, F(1,80) = 1.169, p = .283, or the reading test, F(1,77) = .067, p = .796. Finally, bivariate correlations showed that WM did not correlate with the grammar pretest (r =.013, p = .902), the grammar posttest (r= -.076, p = .477), the reading pretest (r= .002, p = .988), or the reading posttest (r= .033, p = .768). 1
5 Discussion
The results from Experiment 1 show that classroom instruction improved L2 grammar learning and reading abilities over the course of two semesters, but that WM did not modulate the improvement. In effect, all statistical analyses showed a significant main effect of Time, caused by higher scores in the posttests than the pretests, regardless of the participants’ WM span. In contrast, there was no significant interaction or correlation between WM and any of the linguistic tests, indicating no WM effects. These findings might be explicable by methodological and scoring limitations of DC’s test.
First, this test evaluates processing through self-paced reading aloud. Because there was no time pressure to process the sentences fast, and reading aloud does not ensure comprehension, the processing demands of this test may be lower than what Daneman and Carpenter intended. Second, the test’s level of difficulty increased as the test progressed and participants were warned before starting a new level. Because participants knew how many sentences they would encounter in a set before beginning the set, they could have developed mnemonic strategies to memorize the final sentence words more efficiently. In fact, when asked what strategies they used to memorize the words at the end of the WM test in an exit questionnaire, higher span participants admitted having consciously changed strategies as the level of difficulty increased. The third limitation of DC’s test lies in terminating the experiment when participants cannot recall at least three of the five sets of a given level. This generated an enormous variation in the number of sentences the participants read (some read as few as 15 sentences, while others as many as 60). Finally, another limitation resides in the restricted range of scores (2 to 5.5 points), which automatically reduced the possibility of a significant interaction or correlation between WM and the linguistic tests. This phenomenon, known as the restricted range problem, reduces the observed Pearson’s r, and can lead to an erroneous conclusion that there is a lack of criterion-related validity evidence (for more information, see Suen, 2009).
6 Motivation of experiment 2
In order to determine whether these methodological limitations hindered WM effects on L2 learners’ development of grammar and reading abilities, a second experiment was conducted combining DC’s test and WC’s test. WC’s test was selected because it assessed processing (plausibility questions) and storage (recall of final words of sentences), and it forced participants to process sentences and make plausibility judgments fast (the sentence disappeared after a certain amount of time, and slow responses resulted in zero points). Also, it exposed participants to all the sentences regardless of their performance on the easier sets, it had a random level of difficulty, it did not specify the level of difficulty at the beginning of sets (i.e. participants were unable to plan what strategies they would use to remember the words ahead of time), and it solved the restricted score range problem by giving scores ranging from 0 to 80.
In addition to including another WM test, Experiment 2 differed from Experiment 1 in two ways. First, the final sample size increased from 82 (Experiment 1) to 330 (Experiment 2) participants. A large sample pool is crucial in WM studies because most population has medium WM span (the more participants, the more probabilities to have more extreme scores), and because tests measuring individual differences are shown to be 80% noise (Miyake et al., 2000). The increase in sample size was possible by reducing the time between pretests and posttests from one year (Experiment 1, with logical severe attrition) to eleven weeks (Experiment 2). Another important difference between the two experiments resides in WM administration. Thus, in Experiment 1, participants terminated the test as soon as unable to recall the words of at least three sets of a given level, whereas in Experiment 2, participants read all sentences regardless of their performance. The WM tests were computerized and included all sentences to increase reliability and internal validity.
IV Experiment 2
1 Participants
The participants were 361 English learners of Spanish, enrolled in a third semester of Spanish at a large North American university. The university was different from the one in Experiment 1, but the two were Big Ten public institutions and used similar teaching methods, textbooks, and syllabi. In addition, the criteria to be included in the study were identical in both experiments, and the final sample pool for Experiment 2 was 330. The participants of the two experiments only differed in three aspects: (1) 100 more minutes of face-to-face time in Experiment 2 (hybrid) than 1, (2) one year (Experiment 1) vs. one semester (Experiment 2) of time elapsed between pretest and posttest observations, and (3) 82 (Experiment 1) vs. 330 (Experiment 2) participants.
2 Materials and procedure
Participants received extra credit to complete a Spanish proficiency test (before the semester), a language background questionnaire and two WM tests (beginning of the third week of course), a grammar and reading pretest (end of the third week of course), and a grammar and reading posttest (end of the fourteenth week of course). All sessions were in groups, and the pretests and the posttests were administered in one session to reduce attrition. The materials were identical to those in Experiment 1, with three exceptions.
First, the Spanish proficiency tests included a vocabulary section (k =21), a reading section with three passages (total k = 42), and a grammar section (k = 26). The proficiency test was different from that of Experiment 1, because the participants were enrolled in different universities. However, both groups studied at a Big Ten public university in North America with the same goals per proficiency level and the same number of credits required for beginners, and both groups scored at the beginning level. In addition, both proficiency tests were similar in terms of sections (grammar, vocabulary, reading), content, format (multiple choice with 4 options), length (Experiment 1: 71 items; Experiment 2: 89 items), and time (45- to 60-minutes long).
Second, in Experiment 1, learners took DC’s test individually with a researcher, read the sentences aloud, and stopped the test when they failed recalling the words of three of the five sets for a given level correctly. In contrast, in Experiment 2, learners took the test in groups, read the sentences silently, and read all 60 sentences regardless of their performance. 1 Like in Experiment 1 and in DC’s study, sentences were grouped by set size, but unlike in Experiment 1, set size order was randomized.
Third, WC’s test was added to Experiment 2, and the presentation order was counterbalanced: half of the learners took DC’s test first and half WC’s test first. In the latter, participants read 80 sentences in English silently one by one, and pressed a ‘yes’ or a ‘no’ button after each to indicate whether the sentence was semantically plausible. At the end of the set, participants were asked to recall the final word of each sentence within that set. Half of the sentences were plausible and half implausible with subject–object animacy inversion. Sentences were grouped into twenty sets of sentences, divided into five groups (span sizes 2–6 sentences) of four sets each.
3 Scoring
The scoring of the grammar, reading, and DC tests was identical to Experiment 1. For WC’s test, one point was given per sentence if the final word of the sentence was recalled correctly, the plausibility judgment was accurate, and its response time (RT) was between (a) 30 and 5,000 ms, and (b) 2.5 standard deviations above or below the mean. Provided that WM comprises the simultaneous processing and storage of information in cognitively complex tasks, sentences with a correct recall and an incorrect plausibility judgment, or a correct plausibility judgment and an incorrect recall were excluded from the total count. RTs faster than 300 ms and slower than 5,000 ms were not included, because the average college student needs between 225 ms and 400 ms to process a single word (Rayner and Pollatsek, 1989), and because allowing learners to read for as long as they want would jeopardize the complexity of the test. Finally, the mean RTs for plausible and implausible sentences were calculated separately to control for response bias (a learner’s trend to answer yes or no).
4 Results
To address the first two research questions (whether DC’s and WC’s recall span predict grammar and reading development, and whether reading span tests with and without a difficult processing task yield distinct outcomes), a series of bivariate correlations and four repeated-measures ANCOVAs were conducted. The four ANCOVAs were conducted with Time as a within-participant independent variable, and WM as a covariate: two for DC’s test (one for the grammar test, one for the reading test), and two for WC’s test (one for the grammar pretest and posttest scores, one for the reading pretest and posttest scores). To address the third research question (whether response time predicts grammar and reading development), two additional ANCOVAs were carried out with response time. Finally, to address the fourth research question (whether there is a processing-storage trade-off), correlations were conducted between recall span and response time scores.
Descriptive statistics showed that the 330 participants had a WM mean score of 4.15 (SD = 1.01) for DC’s test (maximum score = 5.5) and 42.49 (SD = 10.18) for WC’s test (k = 80) (maximum score = 80). Also, posttest scores (grammar: M = 48.76; SD = 12.86; reading: M = 55.17; SD = 18.95) seemed to be higher than pretest scores (grammar: M = 31.68; SD = 11.21; reading: M = 35.11; SD = 16.17).
a Analyses for DC’s test recall scores (storage; this test lacks a processing score)
The two ANCOVAs with DC’s test showed a significant main effect for Time in the grammar test, F(1,328) = 27.997, p = .001, partial η2 = .078, and the reading test, F(1,328) = 33.088, p = .001, partial η2 = .069, and Bonferroni post hoc tests revealed that participants scored higher in the posttests than the pretests (all, p < .001). However, there was no significant main effect for WM in the grammar test, F(1,328) = 1.870 p = .172, partial η2 = .006, or the reading test, F(1,328) = .767 p = .382, partial η2 = .000), suggesting that low and high span participants were alike. Finally, there was no significant interaction of Time × WM in the grammar test, F(1,328) = 2.145, p = .144, partial η2 = .006, or the reading test, F(1,328) = .841, p = .360, partial η2 = .002, demonstrating that this WM test did not predict L2 development of L2 grammar or reading. The correlational analyses corroborate these findings: bivariate correlations showed that WM did not correlate with any mean test score or learning gain score. These findings indicate that WC’s test does not predict grammar learning or reading longitudinally.
b Analyses for WC’s test recall scores (storage)
The two ANCOVAs with WC’s test recall scores showed a significant main effect for Time in the grammar test, F(1,328) = 1.976, p = .001, partial η2 = .052, and the reading test, F(1,328) = 6.786, p = .001, partial η2 = .056, and Bonferroni post hoc tests revealed that participants were better in the posttests than the pretests (all, p < .001). These results are in line with the findings obtained for DC’s test. Also, there was a significant main effect for WM in the grammar test, F(1,328) = 19.214 p = .001, partial η2 = .055, and the reading test, F(1,328) = 6.786 p = .010, partial η2 = .030). Most importantly, there was a significant interaction of Time × WM in the grammar test, F(1,328) = 6.101, p = .014, partial η2 = .014, and the reading test, F(1,328) = 5.525, p = .019, partial η2 = .006. Importantly, the partial η2 values for the two interactions show a small effect size, meaning that the significant p value was not due to having a large sample pool. In addition, there were significant positive correlations between WM and the grammar posttest (r= .254, p = .001), the grammar learning gain score (r= .135, p = .014), and the reading posttest (r =.209, p = .001). 2 Finally, DC and WC tests correlated positively (r = .181, p = .001). These findings show that DC and WC tests measure verbal working memory, but that only the latter predicts L2 grammar reading development. These results suggest that reading span tests with and without a cognitively demanding processing measure yield distinct outcomes.
c Analyses for WC’s test response time scores (processing)
To discern between single- and multiple-resource WM models, two additional ANCOVAs were performed on the mean RTs of the correct plausibility judgments. The ANCOVAs revealed a significant main effect for Time (grammar: F(1,328) = 90.384, p = .001, partial η2 = .036, partial η2 = .052; reading: F(1,328) = 28.624, p = .102 partial η2 = .000, (posttests > pretests), a significant main effect for WM (grammar: F(1,328) = 51.221 p = .033, partial η2 = .003; reading: F(1,328) = 22.852 p = .001, partial η2 = .001) (higher span > lower span), and a significant interaction of WM × Time (grammar: F(1,328) = 23.932, p = .001, partial η2 = .001; reading: F(1,328) = 3.932 p = .005, partial η2 = .016) (higher span yielded greater scores in the linguistic tests). Again, it is noteworthy that the partial η2 values for the two interactions show a small effect size, indicating that the significant interactions were not generated by a large sample size. In the same vein, mean RTs negatively correlated with the grammar posttest (r = -.418, p = .001), the grammar learning gain scores (r = -.228, p = .001), the reading posttest (r = -.256, p = .001), and the reading learning gain scores (r = -.117, p = .034). Critically, word recall (storage) negatively correlated with mean RTs (processing) (r = -.158, p = .004). These findings suggest that WC’s RTs predict L2 grammar learning and reading longitudinally, and that processing and storage depend on and compete for a shared pool of resources, in line with single-resource models (Just and Carpenter, 1992) (see General Discussion for more information).
5 Discussion
Experiment 2 indicates that WM tests with a taxing processing component (WC’s test) can predict L2 grammar and reading development in beginning classroom learners over the course of one semester, as measured by both recall (storage) and response time (processing) scores. In contrast, WM tests without a taxing processing measure (DC’s test) fail to make these predictions. These results indicate that the cognitive load of the WM test’s processing component can affect the outcomes, and urge researchers to use WM tests with a taxing processing element. Finally, word recall (storage) negatively correlated with mean RTs (processing), in line with Towse et al., (2000, 2002) and single-resource WM models (Just and Carpenter, 1992).
V General discussion
This study has examined longitudinal WM effects on L2 grammar and reading development in beginning classroom learners, using (1) WM tests with (WC’s test) and without (DC’s test) a demanding processing task, and (2) processing (response time) and storage (recall span) measures. In line with our predictions, the results reveal that (1) only WM tests with a processing component performed under time pressure yield longitudinal WM effects, (2) that both processing and storage scores produce longitudinal WM effects, and (3) that processing and storage depend on and compete for a shared pool of resources.
For the first research question, whether WM tests with and without a cognitively demanding processing task generate distinct outcomes, the results show that WC’s test yielded WM effects on L2 grammar and reading development over the course of one semester (Experiment 2), but DC’s test produced no WM effects over one year (Experiment 1) or one semester (Experiment 2). These findings are in line with L2 studies reporting: (1) WM effects with WM tests with a taxing processing measure (reading: Kormos and Safar, 2008; Leeser, 2007; Linck et al., 2012; Walter, 2004; grammar: Linck and Weiss, 2011, 2015; Santamaria and Sunderman, 2015; Sanz et al., 2014; Serafini and Sanz, 2016), (2) WM effects with WM tests without a taxing processing measure but with low proficiency learners (reading: Abu-Rabia, 2003; Harrington and Sawyer, 1992), (3) no WM effects with WM tests without a taxing processing measure (reading: Chun and Payne, 2004; grammar: Sagarra, 2000; Côté, 2016), and (4) no WM effects with WM tests with a taxing processing measure but with either high proficiency learners (grammar: Frost et al., 2013; Gilabert and Muñoz, 2010; Grey et al., 2015; Serafini and Sanz, 2016) or cognitively simple tasks (grammar: Sanz et al., 2014; Santamaria and Sunderman, 2015).
Taking together the results of this study and of previous research, we propose that the presence of a taxing WM processing component, L2 proficiency, and task demands explain the discrepant findings typical of the WM L2 literature with reading and listening span tests. We base our L2 proficiency argument in previous studies showing that lower proficiency learners consume more WM resources than higher proficiency learners (see also Harrington and Sawyer, 1992; Mitchell et al., 2015; Roberts, 2012; Sagarra, 2012; Serafini and Sanz, 2016; Service et al., 2002), and that WM plays a more prominent role in lower than higher proficiency learners (see studies listed in Figure 1). There are only two studies reporting WM effects with advanced learners (Dussias and Piñar, 2010; Hopp, 2014), but the former used a reading span test administered in the participants’ L2, and the two employed cognitively demanding tasks.
Respecting task demands, some scholars claim that task explicitness (operationalized as ‘a metalinguistic task directing [the learners’] attention to the manipulation at the same time as comprehending the input’; Roberts, 2012: 172) is a determining factor for WM effects (Engle, 2002; Reber et al., 1991; Roberts, 2012; Williams, 2015). In contrast, we argue that it is task demands, rather than task explicitness, that influences WM effects. First, any task asking learners to pay simultaneous attention to multiple stimuli is cognitively demanding, regardless of whether the task is explicit or not. Second, Ando et al.’s (1992) findings that high WM learners benefit more from explicit instruction and low WM learners from implicit instruction suggest that implicit tasks consume more cognitive resources than explicit ones (versus the proposal that explicit tasks are more demanding than implicit ones). Third, if task explicitness were crucial for WM effects, studies with explicit tasks would yield WM effects and those with implicit tasks wouldn’t. It is true that there is evidence of WM effects on L2 grammar and reading abilities with explicit tasks (e.g. Faretta-Stutenberg, 2014; Linck and Weiss, 2015; Serafini and Sanz, 2016), no WM effects with implicit tasks (e.g. Kaufman et al., 2010), and WM effects with explicit but not implicit tasks (Santamaria and Sunderman, 2015; Tagarelli et al., 2015). But there is also evidence of WM effects with implicit tasks (e.g. Dussias and Piñar, 2010; Havik et al., 2009; Hopp, 2014; LaBrozzi, 2011; Miyake and Friedman, 1998), no WM effects with explicit tasks (Sagarra, 2000; Frost et al., 2013; Juff’s studies; Wright, 2010), WM effects with both implicit and explicit tasks (Serafini and Sanz, 2016), and even WM effects with implicit but not explicit tasks (Sanz, et al., 2014) (exactly the opposite of Robert’s proposal).
The present study offers additional evidence that task explicitness is not responsible for WM effects. Thus, if task explicitness alone explained WM effects, we would have produced WM effects (following Roberts, 2012) or no WM effects (following Sanz et al., 2014) on both DC’s and WC’s test. However, WM effects were only obtained with WC’s test, although both WM tests were compared to the same explicit task. This demonstrates that the type of WM test (with or without a taxing processing task) was the decisive factor in our study. Notably, all the explicit-task studies reporting WM effects have employed the operation span test. Our study separated the effects of task explicitness and operation span test by using an explicit task with a reading span test.
For the second research question, whether processing (response time) and storage (recall span) yield WM effects, the findings revealed that both scores predicted longitudinal WM effects on L2 grammar and reading development. In line with other scholars (e.g. Barrouillet and Camos, 2007; Towse and Hitch, 1995), we suggest that the chronometry of recall is crucial to understand WM functioning, and recommend the use of multiple performance indices beyond the total number of items that can simultaneously be remembered. Response time is an ideal additional measure because it can be obtained automatically from regular dual-task WM tests, without having to conduct additional tests. Moreover, response time can offer a unique insight into memory function and development difficult to obtain with other measures (e.g. Cowan et al., 1998; Tehan and Lalor, 2000).
In particular, response time constitutes a critical component of two WM models. The two models agree that the cognitive load of a WM processing task does not depend on complexity (e.g. number of items to be recalled in a set), but on time. For Towse and Hitch model (1995; Hitch et al., 2001; Towse et al., 1998, 2000, 2002), time means the total duration of the processing task: while we are executing the processing task, we cannot actively maintain stored items; we need to wait to finish the processing task. In contrast, for Barrouillet et al.’s model (2004; Barrouillet and Camos, 2007), what matters is the proportion of time we devote to the processing task: longer processing tasks have less pauses to divert attention rapidly from processing to retrieval and ‘memory refreshment’, and are detrimental for recall. If this is true, instructors should avoid engaging L2 learners in long processing tasks, such as understanding long sentences.
Finally, the third research question explored whether processing and storage draw on shared or independent resources. The results show a negative correlation between processing and storage for all grammar and reading tests, suggesting that the two components depend on and compete for a shared pool of resources. These findings align with resource-sharing models that claim that WM limits are either complexity-based (Case, 1985; Just and Carpenter, 1992), or time-based (Barrouillet and Camos, 2007; Towse and Hitch, 1995).
It is important to note that these findings do not necessarily contradict multiple-resource models (processing and storage are dissociated; e.g. Baddeley, 2007; Waters and Caplan, 1996). This is so because the WM effects obtained in this study are limited to adult beginning classroom learners completing a metalinguistic grammar task and a reading comprehension task. Higher L2 processing levels, naturalistic learning contexts, implicit tasks, processing online measures, and non-linguistic WM tests completed in a time period longer than one semester could generate different findings. Future longitudinal research examining this issue will be better able to delve into the real contribution of WM and L2 learning and to provide a more nuanced understanding of what factors explain the pronounced variation in adult L2 acquisition.
VI Conclusions
We investigated longitudinal effects of WM on the development of L2 grammar and reading in adult beginning classroom learners, over one year (Experiment 1) and one semester (Experiment 2). The results confirm a distinction between WM reading span tests: DC’s test lacked a cognitively demanding processing task and yielded no effects, whereas WC’s test included a taxing processing measure and was positively related to L2 learning of grammar and reading. Furthermore, the WM effects obtained depended on both processing and storage, and processing negatively correlated with storage. These results demonstrate that WM has a long-term effect on L2 development of grammar and reading in beginning learners, that this effect disappears in WM tests without a processing measure, and that processing and storage depend on and compete for a shared pool of resources, in support of resource-sharing WM models (Barrouillet and Camos, 2007; Case, 1985; Just and Carpenter, 1992; Towse and Hitch, 1995).
Footnotes
Declaration of conflicting interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
