Abstract
The concept of modularity is used to contrast the approach to working memory proposed by Truscott with the Baddeley and Hitch multicomponent model. This proposes four sub components comprising the
I was pleased to be invited to contribute to this special issue of Second Language Research since second language acquisition has played an important part in the development of the multicomponent model of working memory first proposed by Graham Hitch and myself (Baddeley and Hitch, 1974). One of the three initial components of our model, subsequently termed the phonological loop, which was assumed to comprise a short-term store capable of maintaining verbal information for a few seconds, coupled with a rehearsal system based on covert or overt articulation. This initially gave a good account of laboratory-based experimental data but left open the question of what evolutionary function the loop might serve. The chance to investigate this came through access to a neuropsychological patient with a very specific deficit to this system. Our first hypothesis was that it would be necessary for language comprehension or production. We found little evidence for this except in the case of highly atypical artificially developed sentences (Vallar and Baddeley, 1984). We went on to propose that the phonological loop could be important for the initial phonologically-based acquisition of language, finding that the patient’s capacity to learn to associate pairs of words in her native language was normal, while her capacity to acquire vocabulary items in a second language was grossly impaired (Baddeley et al., 1988). We also found that disrupting the phonological loop in healthy people hindered second language vocabulary acquisition while having no effect on learning pairs of unrelated words in a native language (Papagno et al., 1991).
Later studies, largely in collaboration with Susan Gathercole, demonstrated that children identified as having specific language impairment appeared to have a deficit in their phonological loop capacity and that the acquisition of native vocabulary in healthy children was correlated with the capacity of their phonological loop, particularly as measured by their ability to repeat back polysyllabic nonwords (Gathercole and Baddeley, 1990). This and related work is summarized in two reviews, one aimed principally at psychologists (Baddeley et al., 1998) and a second for a broader range of readers (Baddeley, 2003).
While I have not had access to the bulk of the contributions to the special issue, it was suggested that I might reflect on the rather different approach to working memory taken in the article by Truscott (2017, this issue). It will be clear that our two approaches differ substantially, making a direct comparison lengthy and difficult. Instead I have chosen to focus on one feature of the alternative approach, namely its emphasis on a series of modular working memories across different modalities since this offers the opportunity of discussing the role of modularity in our own approach, something that we have not covered elsewhere. This will be followed by an update on the development of the multicomponent model over recent years, leading to an overview of the current model, which should allow the reader to compare and contrast our two approaches.
Modularity can be defined as the degree to which a system’s components may be separated and recombined. The term has been used in areas ranging from artificial intelligence to American literature and includes neuroscience, in which, over the years, it has generated a good deal of controversy, with views ranging from Lashley’s (1929) conclusion that modularity did not occur within the rodent brain at least, to the views expressed by Truscott (2017, this issue) proposing a relatively extensive degree of modularity, with each of many modules having its own working memory.
Lashley (1929) began with the assumption that the brain would be modular, and that there would be one area devoted to the storage of engrams, or memory traces. He searched for its location by systematically lesioning various areas of the rat’s brain, finally abandoning the search and concluding simply that the more brain he removed, the poorer the learning. He went on to propose two principles, namely ‘mass action’ and ‘equipotentiality’. Mass action proposes that the brain operates as a whole, a view that has stood the test of time in that there continues to be a broad association between the amount of tissue loss and capacity for new learning. The idea that all parts of the brain are equal is, however, clearly wrong, and much of research in neuroscience has been concerned with identifying which parts of the brain are responsible for what function. In case of the long-term memory, for example, both rats and people clearly depend heavily on the hippocampus, while vision depends principally on posterior parts of the brain and thinking on the frontal lobes. However, whether these areas and functions could be regarded as modular is open to question and clearly depends on one’s definition of modularity.
In an attempt to tackle this issue, the philosopher Fodor (1983) proposed five features of modularity as follows:
Domain specificity: a module should not involve two separate domains such as vision and language.
Modules should be innately specified.
Modules should be neurophysiologically hard wired.
Modules should be autonomous and independent of other modules; and, finally,
Modules should not be assembled from sub processes (although a fellow philosopher argued against this latter point; Block, 1995).
While Fodor’s proposals caused considerable discussion within philosophical circles, they had relatively little impact on empirically-based investigation where they appeared to be far too rigid and arbitrary to be useful either for empirical research or theorizing. Coltheart (1999), however, suggested that not all of Fodor’s principles should be applied to all potential modules, proposing that ‘A system is modular only when it is domain specific’, suggesting that the other proposed features should be investigated empirically. This probably reflects his own background, which was in psycholinguistics where he was one of a group of researchers making use of John Morton’s (1969) logogen model; this model does indeed take a broadly modular information-processing approach to language. Morton’s logogen model is a good example of a useful and productive use of information processing concepts within cognitive psychology. Like many similar models, it represented conceptual entities such as memory stores in visual form as boxes linked by arrows, which indicated the transfer of information from one component to another. However, the significance of Morton’s contribution depended less on its specific mode of representation than on the concepts underpinning it; for example, the empirically justified separation of separate temporary storage within the language input and speech output systems (Morton, 1969).
In the case of working memory, our own multicomponent model is probably the most modular. It began with what appeared to be a single module: verbal short-term memory (STM), and its relation to long-term memory (LTM). Our work began at a period when intense activity in the area of STM was being followed by a degree of disenchantment at the growing complexity of the field and apparent lack of progress, with the result that many investigators were switching to the newly emerging fields of semantic memory and the Craik and Lockhart (1972) Levels of Processing approach to LTM.
We began by attempting to ask a simple but important question, namely what function does STM serve and, in particular, does it provide a working memory. The term STM is used here atheoretically to refer to the simple storage of limited amounts of material over brief delays, in contrast to working memory, a theoretical concept that assumes an integrated system involving both temporary storage and attentional control, a system that supports a wide range of cognitive processes and tasks. Atkinson and Shiffrin (1968), who proposed the dominant model of the time, assumed a short-term store that also functioned as a working memory, not only in controlling access to LTM, but also in providing a wide range of complex processes such as selecting and operating strategic control over action. Despite its widespread acceptance, however, doubt was thrown on the Atkinson and Shiffrin model by neuropsychological evidence that patients with grossly impaired STM and an immediate digit span of only one or two items had apparently normal LTM. They also showed apparently normal language function and could operate effectively in everyday life: one as a secretary and another as a shopkeeper (Vallar and Shallice, 1990). If the short-term store was necessary for access to LTM, and served as a working memory, why were such patients not amnesic and widely intellectually impaired?
We wanted to follow up the apparent paradox of impaired STM coupled with normal general cognition, but did not have access to such patients. Instead we chose to use a dual task method to create a condition in which verbal STM was blocked by requiring the constant repetition of a novel number sequence while performing each of several tasks that were assumed to depend on a general purpose working memory. The STM burden could then be varied by gradually increasing the length of digit sequence. Our results from studies of free recall of unrelated word lists, prose memory and verbal reasoning, all suggested that performance showed little impairment from a small concurrent load, while showing a clear though by no means overwhelming disruption when concurrent memory load approached span, presumably entirely blocking verbal STM. Our results did indeed suggest some involvement of a single storage system of limited capacity since performance dropped systematically as concurrent load increased. The considerable degree of preserved performance was, however, inconsistent with a number of features of the dominant Atkinson and Shiffrin model (Baddeley and Hitch, 1974). There was clearly a need for a new model.
We resolved to keep our new model as simple as possible, opting for three components and deliberately choosing a visual representation comprising an oval and two oblongs to signal the fact that we did not regard these systems as conventional modules, but rather as broad domains for further investigation. Our first model is shown in Figure 1. It comprised the ‘central executive’, an attentionally limited system that controls the flow of information; the central executive was coupled with two subsystems: the ‘visuo-spatial sketchpad’, which processes and temporarily stores visual and spatial information, and its verbal equivalent, the ‘phonological loop’. We recognized that the model would need considerable development, and began with what we regarded as the simplest of the three systems, the phonological loop, about which a good deal was already known through earlier research on verbal STM. At about this time we were invited to submit a paper to an influential series entitled ‘Recent advances in learning and motivation’; we hesitated since we knew the model was far from complete but decided it was too good an opportunity to miss, a fortunate decision since the resulting chapter (Baddeley and Hitch, 1974) has continued to be widely cited ever since.

The initial three component model of working memory proposed by Baddeley and Hitch (1974).
Our approach was to treat the phonological loop as a module but to attempt to fractionate it into components within a broadly hierarchical overall framework. We began by proposing two components: a temporary phonological store and an active subvocal rehearsal process that refreshed the memory traces. Over the years we have used a range of methods to develop the approach, including using similarity within sequences to test for encoding dimension, contrasting, for example, phonological and semantic coding, and the use of word length and articulatory suppression to investigate the rehearsal process. We have applied both of these methods to patients with specific STM deficits, providing a further test of the underlying model. Others have shown that the model can be fruitfully applied to study a number of other populations including the congenital deaf (Conrad, 1972) and, perhaps somewhat surprisingly, to the understanding of processes underpinning both lip reading and sign language (e.g. Rönnberg et al., 2009; Wilson and Emmorey, 2006). This suggests that if the phonological loop is a module, its domain is that of language rather than audition, although this in turn could be questioned by the fact that it appears to play some role in immediate memory for music (Williamson et al., 2010)
We did not assume, however, that the phonological loop would be modular in the full Fodorian sense, since we assume it to have links with other more long-term aspects of language including syntax and semantics. We assumed, furthermore that the loop has evolved from mechanisms originally specialized in speech perception and production. We also propose that it can be broken down into its components, one involving storage and the other involving subvocal rehearsal, which in due course could themselves be analysed in more detail and linked into theories of speech perception and production respectively. This approach has proved useful, not only analytically but also in terms of its broader application to areas such as vocabulary acquisition and reading (see Baddeley et al., 1998) and indeed second language learning (for a recent survey of work in this area, see Wen et al., 2015; see also, with comments on this theoretical link, Baddeley, 2015; Cowan, 2015).
A similar process occurred in the case of the visuo-spatial sketchpad leading, however, to a rather different pattern of results. Rehearsal does not appear to involve a separate subsystem equivalent to articulation which can potentially recreate the stimulus. Instead rehearsal appears to depend on a process sometimes known as ‘refreshing’, involving sustained attention to a selected item, a process of rehearsal that also seems typical of other nonverbal modalities. Early studies showed that visuo-spatial storage could be disrupted by spatial activity, keeping a stylus in contact with a moving spot of light for example (Baddeley and Lieberman, 1980), initially resulting in the claim that the system was spatial in nature. It was, however, later demonstrated that a separable visual component could be involved, although in actual practice the visual and the spatial typically operate together (Logie, 1986; Logie et al., 1990).
In the early years, we tended to neglect the role of the central executive as being important but less tractable than the subsystems, using the concept as a place holder within the theory, one that accepted the importance of complex attentional control without attempting to study it. In the long run, the assumption of the executive as homunculus, the little man running everything, was clearly unacceptable, and we moved on to the next stage, that of trying to specify the jobs that our homunculus needed to do and then one by one to explain them.
For this, we needed a theory of attention, of which there were several; however, all unfortunately were concerned with the attentional control of perception, whereas we needed a theory of the control of action. Happily this was provided by a simple model developed by Norman and Shallice (1986), which was sufficiently innovative to make it difficult to publish in the conservative world of journal articles but which has since proved extremely valuable. The two authors had somewhat different aims: Norman was interested in explaining slips of action in everyday life, while Shallice was interested in the sometimes bizarre lapses in attentional control shown by certain patients with damage to the frontal lobes. They proposed that attention is controlled in two ways: one largely automatic and the other via what they termed the ‘supervisory attention system’ (SAS). A good example is provided by driving, in which an experienced driver arriving at his or her office having traversed a familiar route might well have no memory of the intervening drive, despite having avoided other cars, stopped at red lights, performed complex activities that required a large number of individual decisions. This implicit control system was assumed to be the one that relied on habit patterns, together with a series of automatic processes that resolve low level conflicts, such as, for example, whether to accelerate when approaching a changing traffic light or slow down. However, if something unexpected occurs, such as a diversion due to road works, then the more attentionally demanding SAS system will cut in, combining long-term knowledge with problem solving systems to work out an alternative route. The SAS component appeared to fit neatly into the central executive role in the existing working memory framework and was promptly adopted, initially being assumed to be a purely attentional system, with temporary storage left to the broadly defined verbal and visuo-spatial subsystems (Baddeley and Logie, 1999).
Over the years, however, it became increasingly clear that our original three component model was encountering problems of two basic types. The first was the issue of just how the visuo-spatial and verbal systems interacted and, second, how the whole system interfaced with LTM. A particular problem occurred in the case of prose comprehension, a topic that had become prominent with the development by Daneman and Carpenter (1980) of the ‘working memory span’ test. Daneman and Carpenter were interested in individual differences in the comprehension of complex prose and in the role played by working memory, which they defined as a system involving the simultaneous processing and storage of small amounts of material. The working memory span task involved presenting a sequence of sentences that participants had to read out, immediately afterwards recalling the final word of each, with span typically being around three or four sentences. Despite its limited range, the test proved to be an excellent predictor of the prose comprehension component of the exam used to select US graduate students. This led to extensive replications and extensions. These included replacing the sentence reading component with a series of simple arithmetic computations (Turner and Engle, 1989), while even simpler tasks could serve the same function provided presentation was rapid enough to make them sufficiently demanding (Barrouillet et al., 2004). Furthermore, these and other variants of the working memory span task proved to be highly correlated with performance on a whole range of demanding cognitive tasks, ranging from the capacity to resist peripheral distraction to performance on the type of reasoning task used to measure general intelligence (Engle, 1996; Kyllonen and Christal, 1990). This has subsequently been extended to second language comprehension and production, which a recent meta-analysis of 79 samples totalling 3,707 participants by Linck et al. (2014) found to reflect separate contributions from the phonological and executive components of working memory.
But where did this leave the three component model? Neither of the two subsystems had anywhere near the capacity needed to perform such complex tasks, and given the assumption that the central executive had no storage capacity, how could the system possibly work? Conscious of the jibe that whenever problems arise information processing theorists simply add another box, we had for 25 years made do with three components, but felt the time had come to add a fourth, which was termed the ‘episodic buffer’ (Baddeley, 2000). This was assumed to provide an interface between the various components of working memory, and between working memory and both perception and long-term memory. Importantly, this interface was assumed to be accessible via conscious awareness. It differed from the existing subsystems in being able to hold a limited number of multidimensional representations, or episodes, and it differed from the central executive in having storage capacity. Hence, given a spoken phrase such as ‘The dying tiger’ it can bind together the various features into a coherent consciously accessible episode. This involves converting the acoustic message into an internally represented form of words, potentially having access to visuo-spatial, acoustic and semantic long-term memory, a process that we initially hypothesized would require substantial input from the central executive. I published an invited paper on this fourth component, emphasizing its role in binding together information from diverse sources into multidimensional integrated episodes (Baddeley, 2000), and was pleasantly surprised to discover that it appeared to be well received and has since been widely cited. However, if a concept such as the episodic buffer is to earn its keep, it should do more than simply provide a convenient means of explaining away otherwise awkward results, and offer a way of asking further questions that result in coherent answers. Over the last decade this task has been taken on by Graham Hitch, Richard Allen and myself.
We began with a question concerning the crucial role of the episodic buffer in binding together information from separate streams. The initial version of the revised model had only one route to the buffer, via the central executive. We proposed to test this using dual task methods to systematically disrupt each of the components of working memory, while studying the capacity to bind together otherwise disparate sources of information. Such a binding process was widely assumed to be a major function of conscious awareness (see Baars, 1997). We chose to study the role of binding in both perception and language. In the case of perception, for example, it is assumed to be necessary for combining separate features such as shape and colour that are processed through different physiological channels into a single integrated percept, for instance redness and triangularity into a red triangle. Similarly in prose comprehension we assumed that information from heard or seen words are bound into a coherent phrase via the contribution of syntax and semantics from long-term memory.
In the interests of breadth and generality we chose to investigate binding using both visual and verbal material. In the visual case, we studied the capacity to bind colour and shape information into integrated objects such as a red triangle, fully expecting that the central executive would play a crucial role. Our hypothesis was consistently rejected over an extended series of experiments. As expected, requiring a concurrent executively demanding task impaired overall performance, but it did so to an equivalent extent for material that involved binding such as coloured shapes, as it did to memory of the unbound components (Allen et al., 2006). A separate series of studies examined the role of working memory in binding words into coherent sentences. We found a clear advantage to immediate recall of sentence-based sequences over the same words in random order. Overall performance was reduced by tasks that disrupt working memory, but the degree of disruption was the same for both types of material, suggesting that the binding of words into meaningful phrases was not dependent on working memory, probably operating largely automatically through processes embedded in LTM.
All the evidence pointed to the episodic buffer as a valuable but passive storage system, the screen on which bound information from other sources could be made available to conscious awareness and used for planning future action. The source of the binding depended on the nature of the material, with the binding of shape and colour into coloured objects depending on the pre-working memory perceptual system, while the binding of words into phrases and sentences depends crucially on relatively automatic language processing systems (Baddeley et al., 2009).
In pursuing this line we found ourselves led into more conventional aspects of attentional research, and we have been conducting a series of studies that apply methods developed in verbal working memory to the currently very active field of visual working memory. This in turn has led to the conclusion that humans have a limited pool of attention which can be biased in one of two directions: focusing either on external perception or internal executive control, with different concurrent tasks differentially impairing each (Hu et al., 2014, 2016).
Our current version of the original tripartite working memory model is shown in Figure 2. It retains its three original components, a view supported by extensive subsequent evidence from analysis of psychometric studies (Carroll, 1993; Gathercole, 1996; Shipstead and Yonehiro, 2016). The current model is considerably more detailed and is essentially hierarchical. It attempts to capture the flow of information within the verbal and visual domains, from perception to working memory, with individual tributaries, from visual, spatial and tactile information combining and being bound into integrated visuo-spatial representations within the sketchpad, which in turn can be bound into multidimensional episodes within the buffer. Similarly, streams of auditory-verbal information can be combined with other non-auditory language-related information within the broad phonological loop domain (see Rönnberg et al., 2009), combining with semantic and syntactic systems in LTM. Both visual and phonological domains are then assumed to influence conscious awareness through the episodic buffer. The episodic buffer in turn uses this information to feed back and control perceptual input and to combine with information from LTM to plan, control and execute future action.

The current elaboration of the original Baddeley and Hitch model.
The role of long-term memory is not reflected in this figure but can be summarized in Figure 3, which illustrates a simple assumption, namely that working memory interposes between cognition and action, providing a means to understand the current situation and plan for the future. Needless to say, size of working memory in Figure 3 does not represent its importance relative to the rest of cognition.

The proposed relationship between working memory, broader areas of cognition and action.
Before concluding, I should say a little about the role of modularity in other theories of working memory. While a range of theories exists, many are focused on either STM or attention. Theories that consider both are less common. The most prominent of these is Nelson Cowan’s (1988) ‘embedded processes model’, which conceptualizes working memory as an activated portion of long-term memory. This view emphasizes an attentionally limited focus of attention surrounded by other recently attended items, with much of Cowan’s work attempting to specify the capacity of this focus, concluding that it comprises about four items.
Superficially this model seems very different from our own but, in fact, the difference is one of emphasis rather than substance. Our multicomponent model began with the analysis of memory span, initially focusing on accounting for evidence from verbal STM and its disturbance in neuropsychological patients, only gradually broadening to consider the attentional executive. In contrast, Cowan’s initial interest was in developmental psychology and attention. His concept of the focus of attention however maps readily onto our concept of a central executive linked to the episodic buffer. Our main difference of opinion concerns the question of whether information from long-term memory and perception is downloaded into a separate system, the buffer, or whether working memory simply operates, as suggested by Cowan, on the ‘addresses’ of the relevant stimuli in LTM. For Cowan, ‘activated LTM’ is not regarded as offering an explanation, but is rather a placeholder, noting that many other things need to be explained. It thus plays a similar role to that played by the homuncular central executive in the early versions of our own working memory model. Cowan has in fact carried out important work that is directly relevant to the concept of a phonological loop (e.g. Cowan et al., 1992), and he and I agree that our two models are in fact almost entirely compatible.
Other broad theories of working memory such as those proposed by Engle (Engle et al., 1999) and Miyake (Miyake et al., 2000) also focus on the central role of attentional control in working memory but accept the need for visual and verbal STM systems. They vary in the degree of emphasis that they place on different aspects of working memory, the role of processes such as inhibition, and the role of links to LTM. Yet other approaches, such as that of Oberauer (2010), attempt to provide a more detailed account of the processes and mechanisms underpinning working memory using tools based on computer simulation and mathematical modelling, yielding an ambitious attempt at an overall theory, although how successfully remains to be seen. I would therefore see Figures 2 and 3 as offering a broad but useful sketch map of the territory that we and others are attempting to explain in detail.
So is working memory modular? Certainly not in the full Fodorian sense, but it can, I believe, usefully be regarded as comprising a number of interacting systems that vary in their degree of modularity. There is a good reason for this: the brain is strongly interconnected, but if everything simply linked to everything else, processing would be chaotic. Processes that are closely linked, for example auditory input and spoken input, are likely to be relatively closely connected anatomically, but also require more extended links, for example, to semantics and to a source of overall executive control. However, rather focusing on their degree of modularity, it is probably more valuable to investigate these various links, noting the extent to which they are integrated within a relatively encapsulated subsystem, on the one hand, while bearing in mind links downstream to perception and upstream to complex cognition and executive control.
Footnotes
Declaration of conflicting interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
