Abstract
Language research has provided insight into how speakers translate a thought into a sequence of sounds that ultimately becomes words, phrases, and sentences. Despite the complex stages involved in this process, relatively little is known about how we avoid and handle production and comprehension errors that would otherwise impede communication. We review current research on the mechanisms underlying monitoring and control of the language system, especially production, with particular emphasis on whether such monitoring is issued by domain-general or domain-specific procedures.
Keywords
We produce and understand our native language with little conscious effort and relatively few mistakes. This is perhaps why most research pays slight attention to how we monitor and regulate the mental processes that convert thoughts into language: Successful communication seems too easy to require much surveillance and control. The fact that speakers correct themselves, however, indicates that the language system must undergo some form of monitoring. In the current article, we argue that the need for monitoring and control stems directly from the architecture and functioning of the language system and that understanding monitoring and control operations is important for (a) a complete theory of language processing and (b) insight into language disorders that may derive from monitoring/control deficits.
Why Does Language Production Need Monitoring and Control?
Figure 1 shows a simplified schema of the word-production system (see Dell, Nozari, & Oppenheim, 2014, for a review). As can be seen, the system is highly interconnected; the same semantic feature (e.g., four legs) connects to several lexical representations (e.g., cat, dog, rat), as does the same phonological feature (e.g., /t/ to cat, rat, mat). The functional consequence of this structural connectivity is that activation of any representation causes activation of several other representations. This organization provides a natural framework for representing similarity: Cat is more conceptually similar to dog than fog, because cat and dog share more semantic features. However, such connectivity also incurs a cost. This is because production not only requires activation but also selection (see Glossary), which occurs in at least two distinct stages (Dell et al., 2014). The first stage (mapping semantic features onto lexical items) ends by selecting one of several activated words. The second stage (mapping the selected lexical item onto sounds) ends by selecting relevant phonemes. Simultaneous activation of related representations interferes with the selection of the target representation at either stage (e.g., Breining, Nozari, & Rapp, 2016; Cook & Meyer, 2008; Costa, Alario, & Caramazza, 2005; Nozari, Freund, Breining, Rapp, & Gordon, 2016). Resolving such interference requires control, and to appropriately detect the need for control, a monitoring system is necessary.

The schema of the two-step model of word production (Dell & O’Seaghdha, 1992). Selection happens in two stages: lexical selection at the word layer and phonological selection (encoding) at the phoneme layer. See Glossary.
Monitoring and Control in Language Production
Comprehension-based monitoring
In his “perceptual loop” theory of monitoring, Levelt (1989) proposed that speakers monitor their own speech the same way they monitor other people’s speech—through the comprehension system. There is evidence that this is to some extent true. For instance, blocking auditory feedback decreases the proportion of detected errors (Oomen, Postma, & Kolk, 2005). Speakers also accept illusory errors presented to them via auditory feedback as their own (Lind, Hall, Breidegard, Balkenius, & Johansson, 2014). However, two lines of work provide evidence against comprehension as the sole monitoring mechanism: (1) Event-related potentials have revealed a similar signal around the time of response generation (ERN = error-related negativity) common to both verbal and nonverbal tasks (e.g., Riès, Janssen, Dufau, Alario, & Burle, 2011), suggesting a similar monitoring mechanism. It is unlikely that language comprehension is used to detect nonlinguistic errors. Importantly, ERN is independent of conscious awareness of errors, which is hard to reconcile with a comprehension-based monitor. (2) Neuropsychological data from individuals with aphasia show a double-dissociation between the ability to understand and monitor speech, as well as a dissociation between the ability to detect different error types like semantic (“dog” for “cat”) and phonological (“hat” for “cat”) errors that do not track the individuals’ performance on semantic and phonological comprehension tests (see Nozari, Dell, & Schwartz, 2011, for a review).
Production-based monitoring
On the other hand, data from both children and brain-damaged individuals suggest that error-detection ability depends on production abilities (Hanley, Cortis, Budd, & Nozari, 2016; Nozari et al., 2011), inspiring a different class of monitoring theories called “production-based monitors” (e.g., De Smedt & Kempen, 1987). The newest version of the production-based monitors, the conflict-based monitor (Nozari et al., 2011), posits that the simultaneous activation of more than one representation in any layer of the production system (words, phonemes, etc.), which creates interference, is actually used as information to generate a “conflict signal” (Botvinick, Braver, Barch, Carter, & Cohen, 2001). This signal is relayed to a part of a general-monitoring system (likely the anterior cingulate cortex), which receives similar conflict signals from other cognitive and motor domains and generates the ERN (see Ullsperger, Fischer, Nigbur, & Endrass, 2014, for a review and alternative perspectives). Importantly, when the production system is immature or damaged, the monitoring signal is weak. Figure 2 explains the relationship between the quality of the production system and monitoring.

How does the conflict-based monitor explain the relationship between the quality of the production system and the ability to detect errors? In a well-trained healthy system, correct trials have much lower conflict than error trials (Dell et al., 2014; left); thus, it is easy to distinguish correct from error trials just by relying on conflict level (large distance between the two means). In an immature, inexperienced, or damaged system, both error and correct trials have high levels of conflict. Thus, it is difficult for the monitor to distinguish between errors and correct responses based on the level of conflict (small distance between the two means). This predicts a relationship between the quality of the underlying production system and the quality of the error signal. For example, if the semantic-lexical part of the production system is damaged, say due to stroke, the patient would make more lexical-semantic errors (“dog” for “cat”) and will be poor at detecting those errors but may have no impairment in detecting his/her phonological errors (see Nozari et al., 2011, for applying signal detection theory to this conflict framework).
Recall that the perceptual loop account proposed comprehension as a mechanism for monitoring language production. But comprehension itself requires a monitoring mechanism. When processing speech, listeners do not wait for sentences to complete before assigning provisional interpretations. For example, upon hearing “Put the apple on the napkin . . .” they immediately interpret “the napkin” as a goal (where the apple should go), until they hear “. . . in the box,” prompting them to revise (Spivey, Tanenhaus, Eberhard, & Sedivy, 2002). Moreover, listeners anticipate: Upon hearing “The boy will eat . . .” they rapidly search for edible objects in a scene (Altmann & Kamide, 2007). Though efficient, incremental processing has the downside of generating potential misinterpretations or wrong predictions that must rapidly be corrected via monitoring and control (Novick, Trueswell, & Thompson-Schill, 2005).
An intriguing possibility is that comprehension itself may be regulated by a mechanism akin to conflict monitoring: When late-arriving input (e.g., “. . . in the box”) suggests a different interpretation from the anticipated one, multiple representations will be simultaneously active, leading to the generation of a conflict signal. If conflict-detection mechanisms or the subsequent application of control suffer, the individual will be unable to resolve interference. Empirical evidence supports this notion: Patients who have trouble resolving conflict in recognition memory—when familiar but irrelevant stimuli excessively interfere with target recognition—also fail to revise sentence misinterpretations and make more naming errors when they must choose among several competing labels (e.g., sofa/couch/loveseat; Novick, Kan, Trueswell, & Thompson-Schill, 2009).
Forward models of monitoring
Two models are noteworthy in this category. Hickok (2012) proposed that activation of a lexical item not only activates its motor-phonological representation but also its perceptual representation (in the auditory and somatosensory cortices for syllables and phonemes, respectively). The perceptual representation reinforces the activation of the motor representation, which in turn suppresses the perceptual representation via inhibitory connections. If a different motor program is executed, the auditory representation is not suppressed and its persistent activation releases an error signal. While Magnetoencephalography studies have shown rapid activation of the word’s representation in the auditory cortex during speech (e.g., Tian & Poeppel, 2010), there is not yet evidence that damage to perceptual representations impairs error detection. Moreover, the model does not explain the dissociation between the detection of semantic and phonological errors (Nozari et al., 2011). Pickering and Garrod (2013) proposed a classic forward model, in which a predicted percept generated through an efference copy of the action command is compared to an actual percept. If the two differ, an error is detected. While the distinction between motor and perceptual representations is clear at the level of phonology, it is unclear what such representations are like for higher linguistic levels such as syntax, or why duplication of such high-level representations is plausible.
Regardless of the specific mechanism through which monitoring is achieved, it serves two purposes: correcting errors already made and preventing such errors from recurring. The latter entails both long-term learning and online behavioral adjustments to optimize performance. In the nonverbal domain, performance on a conflict trial is faster and more accurate following another conflict trial compared to a nonconflict trial. For example, in the Flanker task, deciding that a central arrow (e.g., >><>>) faces left is easier if the preceding trial was <<><< versus >>>>>. Despite various explanations, all agree that this effect (often called conflict adaptation) reflects dynamic reactive control, once stimulus-repetition and learning confounds are removed (Duthoo & Notebaert, 2012). Similar adaptation patterns have emerged in production using a Picture-Word Interference (PWI) task (Freund, Gordon, & Nozari, 2016; Shitova, Roelofs, Schriefers, Bastiaansen, & Schoffelen, 2017), when individuals must name images (e.g., truck) despite a semantically distracting word superimposed on it (e.g., car). This finding suggests that, similar to nonverbal tasks, language production is regulated in real time via a monitoring-control loop.
Language Monitoring and Control: Domain-General or Domain-Specific?
The question of whether cognitive processes are domain-general or domain-specific is important from both theoretical and applied angles. Theoretically, it addresses the issue of modularity (e.g., Fodor, 1983) and its more recent extension claiming functional isolation of the language system from the rest of cognition (e.g., Fedorenko, Behr, & Kanwisher, 2011). Clinically, it addresses the controversy surrounding the effectiveness of “Brain Training” programs (e.g., Simons et al., 2016), which claim that training basic cognitive functions improves performance on a range of everyday tasks because various cognitive systems employ the same core functions. The similarities between monitoring and control in language production, comprehension, and even nonverbal domains can be interpreted as evidence for a domain-general monitoring-control system. However, the term “domain-general” should be defined carefully. Below, we deconstruct the concept of domain-generality into three parts. We explain why domain-generality in one part does not necessarily imply domain-generality in other parts and evaluate whether language monitoring-control processes fit within a domain-general framework according to each.
Domain generality as shared computational principles
Monitoring and control are “processes” operating on “representations.” It is undisputed that representations are domain-specific: Visual and verbal information differ in physical characteristics, the sensory-motor organs that handle them, and the cortical regions that store them. However, the processes that operate on them could follow the same general rules or be system-specific. Assessed by this criterion, the conflict-based theory (Nozari et al., 2011) and Hickok’s (2012) model are both domain-general models, as they assume the same mechanism leads to error detection in any system. The comprehension-based account, conversely, is domain-specific, proposing language-monitoring through the specialized language-comprehension system.
The strongest evidence for a domain-general language-production monitor is the ERN, which, as stated earlier, is found during both verbal and nonverbal task performance (Riès et al., 2011). However, Acheson and Hagoort (2014) failed to find a reliable correlation between ERN magnitudes in a verbal tongue-twister task and a Flanker task. The authors interpreted this as evidence against domain-general monitoring. Methodological problems with correlations aside, there are theoretical problems with this interpretation (e.g., Hsu, Jaeggi, & Novick, 2017). Figure 2 explains the relationship between the quality of the system that generates responses and the quality of monitoring in that system. For the conflict signal (and ERN) to correlate in magnitude across tasks, one must assume similarly strong representations and comparable distributions of conflict in the two tasks, an assumption that is unlikely to be true. Therefore, while the domain-general conflict-based monitor expects ERN as an error signature in any task, it does not predict that ERN magnitude should necessarily correlate between tasks.
Domain generality as shared neural implementation
A different criterion for domain generality is whether a target process uses the same neural resources irrespective of task. Note that domain generality of computational principles does not necessarily imply common neural resources. For example, indirect competitive inhibition implemented through lateral inhibition (e.g., Munakata et al., 2011) posits that representations at the same layer (e.g., cat vs. dog) mutually inhibit one another, proposing a natural way for conflict to be resolved. Because these inhibitory connections are local, involvement of common brain regions across different tasks is not expected even though all systems may use lateral inhibition.
Are language monitoring and control processes neurally domain general or not? fMRI studies of production tasks that require resolution of lexical-semantic conflict have shown increased activation of the left middle temporal cortex (de Zubicaray, McMahon, & Howard, 2015; de Zubicaray, Wilson, McMahon, & Muthiah, 2001), an area that stores lexical representations, compatible with local inhibition. By contrast, involvement of non-language-specific areas is also observed during both monitoring and control of language production. Gauvin, De Baene, Brass, and Hartsuiker (2016) reported that language monitoring engages some areas common to monitoring in nonverbal tasks, including the presupplementary motor area, the anterior cingulate, and ventrolateral prefrontal cortex (VLPFC; Fig. 3). At odds with the predictions of the comprehension-based monitor, involvement of the auditory cortex was not found during monitoring. On the control side, increased VLPFC activation has been reported in production tasks when speakers must resolve competition between multiple activated responses (e.g., de Zubicaray et al., 2015, Nozari, Arnold, & Thompson-Schill, 2014) and in nonlinguistic tasks with similar demands (see Nozari & Thompson-Schill, 2015 and references therein).

View of the main structures in the frontal cortex (left: lateral view; right: medial view). Areas consistently involved in monitoring and control during linguistic and nonlinguistic task performance include ventrolateral prefrontal cortex (VLPFC), anterior cingulate cortex (ACC), and pre-supplementary motor area (pre-SMA). DLPFC = dorsolateral prefrontal cortex; FP = frontopolar cortex; M1 = primary motor cortex; PMC = premotor cortex; PCC = posterior cingulate cortex; SMA = supplementary motor area.
During spoken comprehension, the same left VLPFC regions in each participant (defined by a functional localizer) are activated when the participant completes standard conflict-resolution tasks (e.g., Stroop) and interprets temporarily ambiguous sentences (e.g., January, Trueswell, & Thompson-Schill, 2009). Consistent with findings of neural overlap, multisession training studies that target conflict resolution in recognition memory, and presumably induce neuroplastic changes in the trained region, improve performance on nontrained language production and comprehension tasks that also require conflict resolution (Hussey et al., 2017). Crucially though, skill transfer is observed only when the training and assessment tasks both required the same cognitive operation (conflict resolution); training does not generalize to other measures of executive functions (Harrison et al., 2013).
Domain generality as cross-task adjustment in control
Does conflict detection on Task X promote increased control on Task Y? Neither domain generality in computational principles nor domain generality in neural resources necessarily implies cross-task adjustment in control by themselves, as different populations of neurons within the same prefrontal region can dynamically enforce control over different tasks, or different task aspects (e.g., Mante, Sussillo, Shenoy, & Newsome, 2013).
Earlier, we discussed conflict adaptation—dynamic reactive control on the current trial as a function of control needed on the previous trial. Cross-task adaptation paradigms test for transfer of control from one task to another by interleaving high- and low-conflict trials from each. Investigations into cross-task behavioral adjustments pertaining to language reveal mixed findings. Kan et al. (2013) reported that reading a syntactically ambiguous sentence (which led to temporary misinterpretation) yielded decreased conflict effects (greater control) on a subsequent Stroop task. Hsu and Novick (2016) extended this by showing that high-conflict Stroop trials improved listeners’ ability to revise misinterpretations on a subsequent trial while following temporarily ambiguous instructions. In contrast, Freund et al. (2016) found no evidence of cross-task adaptation between a language-production task (PWI) and a visuo-spatial prime-probe task. Critically, in this alternating design, 2-back adaptation was preserved, which reflects adaptation in PWI as a function of the previous PWI trial, despite no cross-adaptation (i.e., 1-back) between PWI and the prime-probe task. Thus, while one set of studies supports domain-general, cross-task adaptation, the other does not. One possibility is that language production and comprehension are monitored and controlled differently. Before drawing that conclusion, however, variations in the methodologies should be addressed first. For example, Freund et al. (2016) used 120 unique PWI targets, while Kan et al. (2013) and Hsu and Novick (2016) each used one ambiguity type on all critical trials. Further research is required to evaluate the possible effect of these differences.
To summarize, evidence from shared computational principles and shared neural underpinnings for monitoring/control of production, comprehension, and nonlinguistic performance converges on domain-general processes; but whether increased need for control in one task yields greater application of control in another remains unresolved.
Conclusion
Reviewing the architecture of the language-production system, we demonstrated why monitoring and control are necessary during language processing. We showed that a production-based monitor such as the conflict-based monitor can explain the current electrophysiological, neuroimaging, and neuropsychological data and easily extends to comprehension monitoring. The comprehension system and forward models are also likely to play a role in monitoring, but the extent and scope of their contribution requires further investigation. We also discussed evidence supporting a shared monitoring-control loop that dynamically adjusts behavior in production, comprehension, and nonverbal tasks subserved by common neural resources. Yet still unknown is whether engagement of this loop by one task fine-tunes control in another, a future direction we believe could elucidate the boundaries of domain generality of control. The answer to this question is also critical in determining whether training monitoring and control in a nonlinguistic task can improve language production in individuals with language impairment.
Glossary
Footnotes
Declaration of Conflicting Interests
The authors declared that they had no conflicts of interest with respect to their authorship or the publication of this article.
Funding
This work was supported by National Science Foundation Grant 1631993 (to N. N.).
