Abstract
Recent work has begun to focus on the role that individual differences in executive function and intelligence have on the production of fluent speech. However, isolating the underlying causes of different types of disfluency has been difficult given the speed and complexity of language production. In this study, we focused on the role of memory abilities and verbal intelligence, and we chose a task that relied heavily on memory for successful performance. Given the task demands, we hypothesised that a substantial proportion of disfluencies would be due to memory retrieval problems. We contrasted memory abilities with individual differences in verbal intelligence as previous work highlighted verbal intelligence as an important factor in disfluency production. A total of 78 participants memorised and repeated 40 syntactically complex sentences, which were recorded and coded for disfluencies. Model comparisons were carried out using hierarchical structural equation modelling. Results showed that repetitions were significantly related to verbal intelligence. Unfilled pauses and repairs, in contrast, were marginally (p < .09) related to memory abilities. The relationship in all cases was negative. Conclusions explore the link between different types of disfluency and particular problems arising in the course of production, and how individual differences inform theoretical debates in language production.
There is surprisingly little research looking at the relationship between individual differences and the fluency of language outputs. However, understanding differences associated with individual variation is crucially important as theories of language production seek to account for and model individual differences. In this study, we examined the role of executive function and intelligence on disfluency production. The goals of this study were (a) to examine the relationship between individual differences in cognitive control and fluent language outputs and (b) to take a closer look at one particular cause of covert disfluencies (i.e., pauses and repetitions). Previous research has drawn a distinction between overt and covert disfluency: overt disfluencies often provide information about the underlying cause of disruption. In contrast, covert disfluencies, such as an unfilled (or silent) pause, do not reveal anything about the nature of the problem that caused the disfluency to occur (cf. Fraundorf & Watson, 2013). Hypothesised causes of covert disfluency include lapses of attention, planning/thinking demands, lexical retrieval difficulty, and faulty monitoring (Barr, 2001; Bortfeld, Leon, Bloom, Schober, & Brennan, 2001; Clark, 1994; Clark & Fox Tree, 2002; Clark & Wasow, 1998; Garrett, 1982; Levelt, 1983; Maclay & Osgood, 1959; O’Connell & Kowal, 2005). However, it is likely that all of these “causes” affect production in some situations. Thus, there is no one-to-one mapping between problems in production and the manner in which speech is disrupted (Arnold, Tanenhaus, Altmann, & Fagnano, 2004; Fox Tree & Clark, 1997; Postma, Kolk, & Povel, 1990; Shriberg, 1996). However, from a theoretical standpoint, it is important to understand what goes wrong, when it goes wrong, and how the system recovers from a problem (Levelt, 1983). Answers to these questions would help differentiate competing models of production (Deese, 1984; Kempen, 1978; Levelt, 1999; Vosse & Kempen, 2000) and elucidate the monitoring mechanism associated with language production (Blackmer & Mitton, 1991; Hartsuiker & Kolk, 2001; Levelt, 1989; Nooteboom, 1980; Postma, 1997, 2000).
The remainder of the Introduction covers a diverse set of topics. We begin by reviewing models of language production and then “causes” of different types of disfluency with a view towards explaining language-based causes of disfluency and different stages of language production. We next turn to reviewing the literature on individual-based causes of disfluency (i.e., individual differences). Because the majority of work examining the influence of executive function and intelligence on disfluent speech outputs was conducted on a clinical population, we review disfluency production in attention-deficit/hyperactivity disorder (ADHD). The Introduction finishes with a broad overview of executive function and intelligence and the rationale and methodology used in this study.
Models of language production
Many theories of language production assume a stage-based approach, and arguably, the most prominent model was proposed by Levelt (1989). This model assumes three main stages (i.e., conceptualisation, formulation, and articulation) with each consisting of sub-stages (Bock & Levelt, 1994; Ferreira & Engelhardt, 2006; Fromkin, 1973; Garrett, 1975; Pickering, Branigan, & McLean, 2002). Conceptualisation is the non-linguistic thought or message that the speaker wants to convey. Formulation involves grammatical encoding, which is the process of selecting lemmas to convey the message, and in the event of multi-word utterances, assigning grammatical roles, creating a linear sequence, and placement of function words. Formulation also includes phonological encoding, which is the process of retrieving the sounds and organising them into a phonetic plan. Finally, the articulation stage is where the phonetic plans are sent to the articulators. Most models of production also assume the existence of a monitoring mechanism, which is responsible for detecting problems within pre-articulated speech plans (for reviews, see Blackmer & Mitton, 1991; Postma, 2000).
Speech errors have long been used to inform theoretical models of production (e.g., Dell, 1986; Garrett, 1975). Speech errors and overt disfluencies often provide some clue about the underlying cause of the error/disruption. For example, antonym-type repairs, such as turn left . . . right at the junction, point towards incorrect lemma selection (Dell, 1986; Dell & O’Seaghdha, 1992; Hartsuiker & Notebaert, 2010). In contrast, covert disfluencies, such as unfilled pauses, do not reveal anything about the nature of the problem that caused the disfluency to occur. (The label “covert” is used to refer to the fact that the cause of the speech disruption remains hidden, or perhaps, more specifically there is no behavioural evidence about what went wrong.) The idea that covert disfluency may be a signal to particular difficulties in production is an issue that has recently attracted some empirical attention (e.g., Hartsuiker & Notebaert, 2010). However, it is important to keep in mind that there are independent causes of disfluency (e.g., lapses of attention), which do not have anything to do with the stages or sub-stages of language production.
Causes of disfluency
With regard to conceptualisation, it has been shown that filled and unfilled pauses occur frequently in cases where the speaker experiences planning difficulty and needs more time (Clark & Wasow, 1998). 1 This was confirmed in an empirical study that examined constituent ordering in a production task that examined given versus new constituents (Arnold, Losongco, Wasow, & Ginstrom, 2000). When speakers produced fillers (i.e., experiencing production difficulty), they were more likely to first say an easy to produce goal-given noun phrase, followed by a more difficult theme-new noun phrase. In cases in which there were no sentence-initial fillers, speakers were less likely to produce goal-early structures. This pattern is consistent with the Minimal-Load principle (Levelt, 1981, 1982), in that speakers experiencing production difficulty were more likely to produce given constituents first while gaining additional time to plan and produce a new and more difficult constituent second. A second type of study that has made inferences about fillers as a cue to conceptualisation is referred to as “the feeling of knowing.” 2 In these studies, listeners were asked to judge (or rate) the knowledge state of speakers. Brennan and Williams (1995) used a question-answering paradigm and argued that uh and um were not premeditated (speaker) choices but an emergent aspect of production (cf. Clark & Fox Tree, 2002). This is because listeners did not distinguish between uhs and ums, and both led to lower (knowing) ratings than unfilled pauses (Brennan and Williams, 1995; Smith & Clark, 1993; Swerts & Krahmer, 2005).
Grammatical encoding occurs after conceptualisation and is typically assumed to consist of two sub-stages, called functional- and positional-level processing (Bock & Levelt, 1994), and has received more modelling work and empirical investigation, especially in terms of picture naming (Forster & Chambers, 1973; Levelt, 2001; Levelt, Roelofs, & Meyer, 1999; Levelt et al., 1991; Schriefers, Meyer, & Levelt, 1990). The first stage of grammatical encoding involves lemma selection. For lemma selection, there is assumed to be activation of a cohort of semantically related lemmas (Kempen & Huijbers, 1983). In phonological encoding, the sound segments for the selected lemma are retrieved (Garrett, 1975). Evidence for the grammatical encoding stage of production is supported by speech errors, which often reveal semantic similarity with an intended word (e.g., saying “stummy” instead of “stomach” or “tummy”) and relatedly, evidence for the phonological encoding stage of production, is supported by phonologically based speech errors (i.e., words with similar phonology but semantically unrelated).
Perhaps the most robust evidence of lexical access difficulty leading to increased probability of disfluency comes from a study by Hartsuiker and Notebaert (2010). Those authors utilised a network description task, in which they manipulated the name agreement of objects in the network. Name agreement refers to the number of possible names a given object can be described with, and name agreement produces a marked effect on naming latencies (Griffin, 2001; Paivio, Clark, Digdon, & Bons, 1989; Severens, Van, Lommel, Ratinckx, & Hartsuiker, 2005; Vitkovitch & Tyrrell, 1995). In the Hartsuiker and Notebaert (2010) study, participants had to name objects with either high- or low-name agreement. Results showed that pauses and self-corrections occurred more frequently for low-name agreement pictures compared to high-name agreement pictures (see also Schnadt & Corley, 2006). When the most frequent name required an infrequent (neuter gender) determiner, there were more self-corrections and repetitions compared to names requiring more common gender determiner. Determiner selection takes place after the content noun, which it must agree with. The distinct pattern of disfluencies (pauses vs repetitions) was interpreted as evidence of different responses to different processes within production. In this case, there was a more-or-less consistent mapping between type of difficulty and type of disfluency.
In more recent study, Fraundorf and Watson (2013) conducted a study utilising a story re-telling task. Participants were told an excerpt from Alice in Wonderland, and they were then given a series of 14 plots points which they needed to incorporate into the story that they had to “re-tell.” By analysing the places where disfluencies occurred, Fraundorf and Watson made inferences about the underlying cause(s) of distinct types of disfluency (i.e., fillers, silent pauses, and repetitions). Interestingly, they argued that different types of disfluency reflect different strategies for dealing with production difficulty. Fraundorf and Watson concluded that filled pauses were produced in response to conceptual difficulties as evidenced by their frequent occurrence at new plot points and the beginnings of utterances. Unfilled pauses occurred at similar places, but were less reliably produced at particular locations. However, they were also produced in response to other factors affecting grammatical and phonological processes (e.g., lexical frequency and first mention). There were systematic patterns with repetitions, and Fraundorf and Watson argued that these are dependent on the availability of the material to be repeated. In contrast, when errors occurred or when corrections required such things as being more informative, then the corrections occurred relatively late in the production process (i.e., phonological encoding and/or problems detected by the monitoring mechanism) (Hartsuiker & Kolk, 2001; Levelt, 1989; Postma, 2000). Now that we have reviewed *language-based* causes of disfluency, we now turn to *person-based* causes of disfluency, and the majority of existing literature in this area has focuses on a clinical population (i.e., ADHD).
Disfluency in ADHD
One motivation for this study came from a series of papers that investigated sentence production in ADHD (Engelhardt, Corley, Nigg, & Ferreira, 2010; Engelhardt, Ferreira, & Nigg, 2009, 2011; Engelhardt, Veld, Nigg, & Ferreira, 2012). In particular, these papers focused on the role of inhibitory control in sentence production, as many of the main theories of ADHD focus on deficiencies in behavioural-response inhibition (e.g., Barkley, 1997; Barkley & Murphy, 2006; Martel, Nikolas, & Nigg, 2007; Nigg, 2001; Nigg, Carr, Martel, & Henderson, 2007; Pennington & Ozonoff, 1996; Schachar, Tannock, Marriott, & Logan, 1995; Tannock & Schachar, 1996). The most robust findings with respect to inhibitory control were shown for repair disfluencies (i.e., when the speaker stops and then starts over with a new word or phrase). In these studies, participants saw two pictures and a verb and they had to produce a sentence. Individuals diagnosed with the combined subtype of ADHD (i.e., those with symptoms of both inattention and hyperactivity-impulsivity) and those with partially remitted ADHD produced more repairs compared to typically developing controls (Engelhardt et al., 2010; Engelhardt et al., 2012). Approximately two-thirds of the repair disfluencies were cases in which participants made a structural revision, that is, they switched from active to passive voice (e.g., the girl . . . . . . the bicycle was ridden by the girl), and approximately one-third showed clear evidence of a production error (e.g., the boy . . . girl had ridden the bicycle), which would be consistent with lexical selection difficulty (Shao, Meyer, & Roelofs, 2013). This result was later extended to individual differences in typically developing individuals. Engelhardt, Nigg, and Ferreira (2013) showed that performance on the Stroop task and stop-signal reaction time (both primarily inhibition tasks) can account for nearly one-third of the variance in repair disfluency production, and this finding held even when individual differences in intelligence and set shifting were controlled for.
Executive function and intelligence
Executive functions are often described as low-level cognitive control mechanisms that govern thoughts and actions in the service of achieving goals and monitoring performance (Burgess, 1997; Denckla, 1996; Friedman & Miyake, 2004; Logan, 1985; Miyake et al., 2000; Rabbitt, 1997; Stuss & Benson, 1986). The most commonly postulated executive functions are updating/monitoring of working memory, set shifting, and inhibition (Miyake et al., 2000, 2001). It is widely assumed that these executive functions play a role in most, if not all, cognitive processing, including language production (Roelofs, 2003). A large literature has emerged concerning how different executive functions are related to different types of intelligence. For example, Friedman et al. (2006) reported that working memory was highly related to both fluid and crystalized intelligence, and that inhibition and set shifting share less variance with intelligence (see also Baddeley & Logie, 1999). If there are any criticisms of the executive function/intelligence work, it is that it is difficult to operationally define and empirically dissociate the hypothesised executive functions from one another and from more general individual difference variables, such as intelligence or processing speed (Ardila, Pineda, & Rosselli, 2000; Duncan, Johnson, Swales, & Freer, 1997; Friedman et al., 2006; Jester et al., 2009; Teuber, 1972). The reason executive functions and intelligence are difficult to dissociate is because of shared variance, that is, these abilities tend to correlate within individuals (Friedman et al., 2007; Kline, 1991; Miyake et al., 2000). The higher the shared (as opposed to unique) variance, the more difficult it is to dissociate different constructs, and problematically, the ratio of shared-to-unique variance differs between abilities and sometimes between samples (Miyake & Friedman, 2012).
Investigations of the role of executive functioning in more complex cognitive tasks often use individual differences paradigms. The basic idea is that if an ability, such as language production, relies on a (low-level) executive function, then individuals varying in that executive function will also vary, for example, in language production performance. If there is no relationship, then correlations between tasks should be at or near zero. Research into the role of executive functioning on the fluency of language outputs interestingly began with the study of clinical populations (e.g., ADHD, autism spectrum disorder [ASD]) and individuals with deficits in particular abilities (e.g., older adults). 3 More recently however, research has shifted to typically developing individuals (e.g., Engelhardt et al., 2013; Shao, Roelofs, & Meyer, 2012).
Intelligence is widely assumed to reflect functioning across broader and wider neural networks compared to executive function. All participants taking part in the Engelhardt et al. ADHD studies were also given an assessment of full scale intelligence as part of the experimental protocol. Specifically, each participant completed five subscales from the Wechsler Adult Intelligence Scale (Wechsler, 1997a, 1997b), and so, several previously published papers reported the correlations between intelligence and various sorts of disfluency.
Current study
In this study, we were primarily interested in covert disfluencies (i.e., repetitions and unfilled pauses). Repetitions are when a speaker stops and then repeats something they just said with no functional benefit, and unfilled pauses are silent pauses in speech. Repetitions were shown to be significantly correlated with full scale intelligence (Engelhardt et al., 2010), are produced more frequently by participants with the inattentive subtype of ADHD (Engelhardt, Ferreira, & Nigg, 2011), and are significantly correlated with Stroop performance (Engelhardt et al., 2013). Unfilled pauses were shown to be significantly correlated with two of the Wechsler subscales, vocabulary and matrix reasoning, and were produced more frequently in participants with the inattentive subtype of ADHD. These correlational findings are intriguing, but at this point, it is not known how much of the variance associated with these correlations is shared and how much is unique. As mentioned previously, working memory tends to correlate highly with intelligence, whereas inhibition and set shifting do not (Friedman et al., 2006).
Covert disfluencies present a unique challenge to empirical investigation because, as mentioned above, there is no behavioural evidence concerning the nature of the underlying problem (Levelt, 1983). In this study, we were interested in the relationship between memory abilities and the rate of covert disfluency production. Our primary hypothesis was that in cases where memory retrieval was too slow to keep up with production, unfilled pauses and repetitions would be a likely outcome (similar hypotheses have been proposed for stuttering, see Howell & Au-Yueng, 1999). We employed a memorise-and-repeat sentence production task, which as noted in earlier work (e.g., Ferreira, 1991) reduces the need for conceptual-level processing, lexical selection, and syntactic planning. Therefore, this task eliminates much of the difficult work of both conceptualisation and grammatical encoding (Bock & Levelt, 1994; Bock, 1996; Ferreira & Engelhardt, 2006; Levelt, 1989). Instead, the memorise-and-repeat production task is primarily about memory encoding, memory retrieval, and speech production. 4 Based on previous research, we might expect covert disfluencies, in particular repetitions, to be related to verbal intelligence (e.g., Engelhardt et al., 2010). However, our primary hypothesis (given memory-associated task demands) was that a substantial number of covert disfluencies would be due to (slow) memory retrieval. One possibility is that the correlation between verbal intelligence and repetitions is because of shared variance between (verbal) intelligence and working memory ability (Miyake & Friedman, 2012). In this study, we assessed both verbal intelligence and working memory, and we also obtained two measures of executive functioning (i.e., Stroop task and Wisconsin Card Sorting task).
Methodology
We used six subtests from the Wechsler Adult Intelligence Scale (Wechsler, 1997a) to create a two-factor latent variable model (see Figure 1). Latent variable modelling has several advantages given the goals of the study and the nature of the data set. The first is that latent variables represent shared variance from multiple tasks used to tap the same underlying construct. Therefore, latent variables are less susceptible to idiosyncratic task properties (i.e., task impurity issues). The second advantage is that because measurement error is separated from the latent variable, the latent variable provides a purer measure of the constructs of interest. The third advantage is that latent variables are allowed to correlate, and thus, the shared variance between constructs is incorporated and quantifiable within the model. We employed a nested model fitting procedure to ascertain whether particular types of disfluency are associated with variability in verbal intelligence or working memory abilities. To do so, we added another observed variable (disfluency) to the model and drew pathways from each latent variable onto disfluency. We then ran hierarchical analyses in which we tested whether omitting one of these two pathways resulted in a significant decrease in model fit. If it does, then it suggests that that pathway is important and indicates substantial shared variance. In terms of sample size, it is recommended practice to have a minimum of 10 participants for each observed variable, and a minimum of three observed variables for each latent variable (Stevens, 2002).

Two-factor structural equation model.
Summary
In this study, participants memorised and repeated syntactically complex sentences. Each sentence contained a main and subordinate clause, and the order of clauses was reversed in half of the items (Christianson, Hollingworth, Halliwell, & Ferreira, 2001; Ferreira, Christianson, & Hollingworth, 2001: for examples, see Table 1). The memorise-and-repeat production task does not require the speaker to generate semantic and syntactic content, but it does require the speaker to generate phonological and motor programmes. Importantly, this task increases memory demand compared to previous work, and our primary hypothesis was that covert disfluencies (i.e., unfilled pauses and repetitions) are due to (slow) memory retrieval, and if so, they should be related to individual differences in working memory. If not, then they should be related to individual differences in verbal intelligence, as reported by Engelhardt et al. Thus, our main interest was the relationship between covert disfluency and memory ability. However, we also coded several other types of disfluency as well. We noted filled pauses (i.e., uh’s, um’s, and er’s) and repairs.
Example sentences.
Method
Participants
Participants were 78 adults with a mean age of 20.0 years (standard deviation [SD] = 2.36, range: 18.2-34.90). In total, 69 were female, and nine were male. Approximately half were compensated £20 for taking part in the study, and half received participation credits from the undergraduate psychology pool at Northumbria University. 5
Standardised measures
Intelligence and working memory
Participants completed seven subtests from the Wechsler Adult Intelligence Scale, 3rd edition (Wechsler, 1997a). The verbal intelligence subtests were comprehension, information, similarities, and vocabulary, and the working memory subtests were arithmetic, backward digit span, and digit span. Vocabulary requires participants to provide the definitions of words and measures the degree to which one has learned and is able to express meanings verbally. Similarities require participants to describe how two words are similar, with the more difficult items typically describing the opposite ends of a “unifying continuum.” Thus, it measures abstract verbal reasoning. Comprehension requires participants to answer questions based on understanding of general world knowledge and social situations. The test is designed to assess verbal reasoning, verbal comprehension, and use past experience to demonstrate knowledge and judgements. Information requires the participant to address a broad range of general knowledge topics. It assesses the ability to retrieve general information. Arithmetic requires the participant to mentally solve arithmetic (story type) problems. It assesses a variety of abilities associated with numerical reasoning, short-term memory, and mental manipulation. It also requires concentration, attention, and avoidance of distraction. Digit span requires participants to encode and then recall a sequence of numbers in the same order as given by the examiner. The backward digit span requires participants to recall items given in the reverse order. These tasks assess working memory, manipulation, attention, and encoding.
Wisconsin card sorting test
A computerised version of the Wisconsin Card Sorting Test was administered. The task requires participants to match a card to one of four other cards based on one of four attributes (shape, colour, quantity, or design). There were four kinds of shapes (i.e., stars, crosses, triangle, and circles) and there were four colours (i.e., red, yellow, blue, and green). Participants were given feedback after every decision. After five correct decisions, the correct match attribute changed. This was repeated through nine cycles. The dependent measure was number of perseveration errors, that is, the number of incorrect decisions based on the previous match attribute (Anderson, Damasio, Jones, & Tranel, 1991; Heaton, Chelune, Talley, Kay, & Curtiss, 1993).
Stroop task
The Stroop task assesses the ability to monitor response conflict and suppress a competing response in order to successfully execute the task requirements (i.e., inhibition). Participants completed a computerised version of the Stroop task (Golden, 1978; Stroop, 1935), which contained 160 trials in total. There were six conditions (i.e., 20 colour-congruent trials, 20 colour-incongruent trials, 40 patch trials, 40 written trials, 20 word congruent trials, and 20 word incongruent trials). Responses were recorded automatically through the use of a voice trigger, and to compute an interference score, we subtracted the mean reaction time (ms) of the colour-congruent trials from the mean of the colour-incongruent trials (Lansberge, Kenemans, & van Engeland, 2007).
Sentence production
Materials
The 40 experimental items were based on sentences from Christianson et al. (2001). There were 20 subordinate-main sentences which were all ambiguous and contained optionally transitive subordinate clause verbs. Half of these had a modifier between the main clause subject and main clause verb (see Supplementary Materials, Section A, for a list of critical items). The mean length of these sentences was 10.4 words (46.35 characters) and they ranged in length from 8 to 13 words. There were 20 main–subordinate sentences, and the subordinate clause for these was all transitive. Of the 20, five contained reflexive subordinate clause verbs, and half had a modifier between the main clause subject and main clause verb. The mean length of these sentences was 11.15 words (49.4 characters) and they ranged from 9 to 14 words. There were not length differences between the subordinate–main and main–subordinate sentences (both ps > .16). None contained commas separating the subordinate and main clauses, and there were 421 words in total in the experimental items.
Procedure
The task was based on the procedure from Ferreira (1991). Participants were instructed that they would see a sentence that they had to memorise and repeat back, and that it was important that they spoke the sentence exactly as it was written and in a natural manner. Participants pressed the space bar and a fixation cross appeared for 1 s. The fixation cross was followed by the sentence, and it was presented in the centre of the computer screen. After participants had memorised the sentence, they pressed the space bar, and a question appeared on the screen (i.e., “What happened?”). Participants spoke the sentence out loud, and when they were finished speaking, they pressed the space bar to start the next trial. There were three practice trials and 40 experimental items. The order of trials was randomly determined for each participant. If participants forgot the sentence on a particular trial, they could press the “R” key to go back and re-view the sentence. Partial recordings were not saved. Participants spoke into a condenser microphone in a sound dampened testing cubicle and the experiment was programmed with E-prime experimental software. The sentences were automatically recorded and saved as .wav files.
Utterance coding
Memory errors
Any errors in the utterance affecting content words were counted as memory errors. These included omissions of content words, incorrect inclusions, and incorrect substitutions (e.g., archivist vs activist, large vs big, and floor vs ground). Minor differences (e.g., eating vs eatin, book vs books) and differences involving function words (e.g., the vs a, have vs has) were not counted as memory errors.
Disfluency
Three main types of disfluency were examined: unfilled pauses, repetitions, and repairs. 6 Repetitions refer to unintended repeats of a word or string of words with no functional benefit. Repairs occur when a speaker suspends articulation, and then starts over with a new word or phrase. We assessed the lengths of all unfilled pauses that were 250 ms or greater. We viewed the threshold for an unfilled pause as a somewhat subjective decision because often researchers will utilise a higher threshold (e.g., 1-3 s), so as to exclude prosodic pausing (Kormos & Denes, 2004; Lake, Humphreys, & Cardy, 2011). However, a recent study by De Jong and Bosker (2013) that investigated perceptions of fluency in L2 learners and accounted for speech rate argued that 250 ms is the best threshold for unfilled pauses, and this is consistent with the original work of Goldman-Eisler (1968) (see also, Harley, 2013; Harley & MacAndrew, 2001; Redford, 2013). With a 250-ms threshold, approximately 16% of sentences contained at least one unfilled pause. The data set was coded twice, once by the second author and once by a trained research assistant. The first author compared the two data files and resolved discrepancies. In cases in which the length of an unfilled pause differed by more than 50 ms, it was reassessed by the first author. 7 For the remainder, we averaged the two durations. The corpus contained 3,120 sentences (approximately 33,000 words in total), and the proportion of sentences with a particular type of disfluency is shown in Figure 2.

Proportion of sentences containing a disfluency broken down by type and proportion of sentences with memory error. Error bars show the standard error of the mean.
Procedure
Participants were recruited via fliers posted on university grounds and by advertisement on the psychology undergraduate participation pool. Upon entering the lab, participants provided written informed consent and basic demographic information. They then completed each of the tests in the battery (verbal intelligence, working memory, Wisconsin Card Sort task, Stroop task, and the sentence production task). Tasks were completed in different rooms and in different testing cubicles, and participants were given obligatory breaks between tasks to avoid fatigue. Each participant completed the tasks in the same order. (Participants completed several additional tasks, including the autism spectrum quotient [AQ] and a second sentence production task.) The entire testing session lasted approximately 2 hr.
Reliability
The standardised measures used in this study are all well-established tests with widely accepted reliability. The Wechsler intelligence tests (and the subscales) typically have reported reliabilities in the .85 to .95 range (Friedman et al., 2007; Friedman et al., 2006; Wechsler, 1997a, 1997b). The Stroop task and the Stop task have reported reliabilities in the .80 to .90 range (Friedman et al., 2007; Friedman & Miyake, 2004), and the Trails task and the Wisconsin Card Sort task typically have lower/borderline acceptable reliability ~.70 (for extended discussions of reliability in standardised executive function tasks, see Denckla, 1996; Friedman & Miyake, 2004; Rabbitt, 1997). For the non-standardised measure (i.e., the sentence production task), we computed split-half reliabilities, and we used Spearman–Brown prophecy formula corrected coefficients (Brown, 1910; Spearman, 1910). The mean reliability for unfilled pause was α = .83, repetitions α = .61, repairs α = .37, and memory errors α = .82. The only type of disfluency that fell outside of traditionally accepted levels was repairs (Nunnally, 1978).
Analysis procedures
Structural equation models were created and run using AMOS. In the analyses, we report several fit indices for our models, which is recommended practice (Anderson & Gerbing, 1988; Gonzalez & Griffin, 2001; Hu & Bentler, 1995, 1998, 1999; Kane et al., 2004; Kline, 1998; Miyake, Friedman, Rettinger, Shah, & Hegarty, 2001). The chi-square statistic reflects significant differences between the observed covariance matrix and the reproduced covariance matrix. With chi-square tests, a non-significant value is desirable (Satorra & Bentler, 2001). We also report the confirmatory fit index (CFI), which reflects improvement of the model fit relative to a baseline model in which all covariances are zero. This test also reflects the proportion of the observed covariance matrix explained by the model, and so, it reflects how well the model fits the data. The acceptability level of the CFI is .90 (Stevens, 2002). Finally, we report the root mean square error of approximation (RMSEA). Here, values less than .05 indicate good fit (Kline, 1998). In cases where we wanted to compare (nested) models, we utilised a chi-square difference test in order to determine whether one model fits the data significantly better than another.
Results
Data screening and preparation
Data points greater than 3.0 SDs from the mean for each variable in the data set were defined as outliers. Outliers were replaced with the mean of that variable (McCartney, Burchinal, & Bub, 2006). This avoids listwise deletion and the corresponding reduction in power (Shafer & Graham, 2002). There were three outliers in the data set, which were assessed via standardised values (vocabulary had one outlier −3.03, backward digit span had one outlier −3.13, and digit span had one outlier −4.89). 8 Finally, multivariate tests are sensitive to deviations from normality, and therefore, we applied transformations (i.e., square root, logarithm, or inverse) to the skewed variables in the data set (Kline, 1998).
The descriptive statistics and bivariate correlations between variables are presented in Tables 2 and 3, respectively. The first step in the analysis was to ensure that a two-factor model fits the verbal intelligence and working memory measures better than a one-factor model. Results showed that a two-factor model was generally a good fit χ2(8) = 9.812, p = .28, CFI = .980, RMSEA = .054. A one-factor model, in contrast, was not a good fit χ2(9) = 23.674, p = .005, CFI = .834, RMSEA = .146. A chi-square difference test showed that model fit was significantly better with the two-factor model than with the one-factor model Δχ2(1) = 13.86, p < .001. (Factor loadings are shown in Figure 1.) Thus, there was statistical evidence supporting the dissociation of the two main hypothesised constructs (see also Ackerman, Beier, & Boyle, 2005; Martin, 2001).
Descriptive statistics for the intelligence subtests, executive function measures, and the sentence production task.
SD: standard deviation.
Inverse transformation.
Square root transformation.
Bivariate correlations between verbal intelligence, executive function, and disfluencies.
Gender coded 0 = male and 1 = female.
p < .05; **p < .01; #p < .07.
Memory errors
The second step in the data analysis focused on memory errors or failures to correctly repeat the sentence. The bivariate correlations (in Table 3) revealed that memory errors were significantly correlated with the two working memory span tasks. We tested the proportion of sentences with a memory error in a two-factor (nested) path model, in which there were factor loadings from both verbal intelligence and working memory loading onto memory errors (see Figure 3). Results from the hierarchical tests showed that model fit was significantly worse when the regression weight from working memory to memory errors was set to zero (see Table 4). Models fits were good when memory errors loaded on working memory and were poor when memory errors were loaded onto verbal intelligence. This suggests that the memorise-and-repeat paradigm was operating as intended. We observed that successful task performance was related to working memory, and in particular, the span tasks.

Nested path model.
Hierarchical structural equation analyses examining memory errors and disfluencies against verbal intelligence and memory ability.
CFI: confirmatory fit index; RMSEA: root mean square error of approximation; WM: Working Memory.
p < .10; *p < .05; ***p < .001.
Note. Shaded results show significant and marginal results.
Disfluency
We tested unfilled pauses and repetitions in hierarchical path models. We also tested repairs even though our main hypotheses focused on covert disfluency. The results are shown in Table 4. 9 For repetitions, we found that model fit was significantly worse when the regression weight from verbal intelligence to repetitions was set to zero. Model fits were generally good when repetitions loaded onto verbal intelligence (cf. RMSEA) and were poor when the verbal intelligence pathway was set to zero. This indicates that the tendency to produce repetitions is related to more general verbal abilities and not to individual differences in working memory. For unfilled pauses and repairs, there was a marginal decrease in model fit when the regression weight from working memory was set to zero (see Table 4), and in general, model fits were quite good in both sets of analyses. The marginality of these two findings is obviously unsatisfying from an empirical standpoint, especially given the size of the data set. However, there are a couple of points that should be kept in mind. The first is that unfilled pauses were significantly correlated with memory errors (r = .63), 10 which was the highest correlation in the data set. Repairs were also significantly correlated with memory errors, and repairs and unfilled pauses were significantly correlated with each other. Also, recall that memory errors clearly loaded onto the span tasks. Thus, the results indicate that there is shared variance between the working memory (span) tests, memory errors, and two types of disfluency (i.e., unfilled pauses and repairs). Repetitions, in contrast, were clearly related to verbal intelligence.
As a final analysis of the proportion of sentences with a disfluency, we conducted a post hoc model re-specification in which we removed arithmetic from the model and substituted memory errors. The factor loading for arithmetic was the lowest of the six Wechsler tests, and it did not correlate with the two span tasks, whereas memory errors were significantly correlated with both. Results of this follow-up analysis were similar to the results presented in Table 4. There was a significant decrease in model fit Δχ2(1) = 5.01, p < .05 when the regression weight from verbal intelligence to repetitions was set to zero. However, the results for unfilled pauses and repairs were both significant: Model fit significantly decreased when the regression weight from working memory was set to zero, unfilled pauses Δχ2(1) = 27.50, p < .00,001 and repairs Δχ2(1) = 4.46, p < .05. We acknowledge the problems associated with model re-fitting post data collection. However, in this case, a minor increase in model fit led to a much clearer picture, especially in terms of statistical significance, of the relationship between individual differences in working memory and both unfilled pauses and repairs. In particular, it made a substantial impact on unfilled pauses. The factor loading from memory errors to the latent variable representing working memory was −.52, whereas the factor loading from arithmetic to working memory was .34. (The other factor loadings were virtually unchanged when arithmetic was replaced with memory errors.) So across both sets of hierarchical analyses, we see that repetitions are related to verbal intelligence and not to working memory, and unfilled pauses and repairs are more related to working memory than to verbal intelligence.
Unfilled pause length
Recall that we used a relatively low threshold for unfilled pauses, and results, at the worst, showed a marginal relationship between unfilled pauses and memory ability. However, the analyses thus far have only considered the presence/absence of an unfilled pause. As part of the data coding procedure, we also assessed the length of all unfilled pauses in the data set. If there were multiple pauses greater than 250 ms in length, then the length of each was noted. No sentence contained more than four unfilled pauses, and only two had four unfilled pauses. A histogram showing the distribution of pause length is shown in Figure 4. As can be seen, the distribution was positively skewed with the vast majority of unfilled pauses being shorter than 1 s, and the overall mean was 887 ms. To investigate the relationship between the length of unfilled pauses and the individual difference variables, we ran correlations between the individual difference variables and the mean length of first pauses, the mean length of second pauses, and the mean length of third pauses (see Table 5). Results from this follow-up analysis revealed some interesting differences compared to the previous analyses. The previous analyses showed that the presence/absence of unfilled pauses was marginally-to-significantly related to working memory. However, the mean length of those unfilled pauses seems to be most closely related to the two executive function measures (Stroop task and perseveration errors). The Stroop task is most commonly used to assess inhibition and perseveration errors used to assess set shifting. However, there were also two marginally significant correlations between two of the verbal intelligence subtests and the length of first pauses. The relationship between pause length and the two executive function measures is interesting because the Stroop task and perseveration errors were not correlated to any of the other variables in the data set. We return to this issue in the General Discussion.

Histogram showing the distribution of the length of all unfilled pauses.
Bivariate correlations between individual differences measures and unfilled pause length.
Seventy-three participants produced at least one unfilled pause.
Thirty-eight participants produced at least two unfilled pauses on one trial.
Fourteen participants produced at least three unfilled pauses on one trial.
p < .08, *p < .05.
Note. Shaded regions show marginal and significant correlations.
Discussion
The focus of this article was the role of individual differences in the production of covert disfluencies, which occur when there is no overt error or observable correction but yet a clear disruption of speech output. As mentioned in the Introduction, covert disfluencies are empirically difficult to study because both the problem and correction of the problem remain hidden. To get around this difficulty, we selected a task that had a relatively narrow cognitive-demand profile, and specifically, we used a memorise-and-repeat sentence-production task (Ferreira, 1991). In previous work, we reported that repetitions (Engelhardt et al., 2010) and unfilled pauses (Engelhardt et al., 2013) were related to individual differences in (verbal) intelligence. Much of the previous literature suggested that covert disfluencies reflect any one of a number of problems in the course of speaking, but what we were interested in was the relationship between memory ability and covert disfluency. That is, do memory retrieval difficulties (or memory retrieval slowdowns) lead to higher rates of disfluency in sentence production? If they do, then it suggests that individual differences in memory can affect the fluency of speech outputs. It does not mean, however, that it will in every situation or when memory demands are lower. The obvious caveat, which applies to all laboratory-based production studies, is that everyday speaking situations are much different as various cognitive demands (e.g., attention, memory, turn taking, and comprehension) fluctuate in real time to impact ongoing speech planning and output. However, our main interest in this study was the relationship between working memory and covert disfluency. We wanted to isolate that relationship, and in particular, in the context of previous work (e.g., Engelhardt et al., 2013), contrast individual differences in memory ability with individual differences in verbal intelligence. Thus, we felt that a controlled speaking situation was necessary given the specificity of the hypotheses, and thus, all participants in our study produced the same words and the same syntactic structures, which ensured that production demands were equal across participants. In general, we observed that the relationships between disfluency production and the individual difference variables were negative (i.e., lower ability individuals tended to produce more disfluencies). In the remainder of the discussion, we cover the proportion of sentences with a particular type of disfluency, the effect of executive function on the length of hesitations, the strengths and limitations, and end with conclusions.
Types of disfluency
To summarise the main findings, we found that repetitions were significantly related to verbal intelligence, and not to working memory. Unfilled pauses and repairs, in contrast, were (at worst) marginally related to working memory. A follow-up analysis involving a (post hoc) model tweak resulted in significant effects for both unfilled pauses and repairs on working memory. Setting the post hoc significance issues aside for the moment, repairs showed a nearly identical regression weight as unfilled pauses on working memory, and both repairs and repetitions showed no relationship with the “other” latent variable. That is, repetitions were not related to working memory, and repairs were not related to verbal intelligence (both regression weights <.072). Unfilled pauses were different, as there was some systematic variance between unfilled pauses and verbal intelligence. In the Introduction, we made an argument about the importance of being able to account for shared versus unique variance. For researchers working on the inter-relations of executive functioning and intelligence (e.g., Miyake & Friedman, 2012), this may seem obvious. However, as individual difference studies of language production (and comprehension) have begun to appear, it is not always clear that language researchers fully appreciate the issues associated with shared variance. For example, Vuong and Martin (2013) had participants complete a sentence comprehension task and a Stroop task, and based on a correlation between the two, argued that executive control is important in language comprehension. However, from one correlation, it is impossible to determine whether the variance is shared or unique. The Stroop variance may well be shared variance with working memory, and thus, to make claims about executive function-x versus executive function-y, it is advisable to simultaneously model both. Our data have an example of how bivariate correlations can be somewhat misleading. From Table 3, it is clear that repetitions are more related to intelligence than to working memory, and repairs are more related to working memory. However, unfilled pauses showed slightly higher correlations with intelligence, but the hierarchical tests revealed slightly more systematic (shared) variance with working memory. The main reason this oddity can happen is because our model contained both working memory and verbal intelligence, and it also estimated the correlation between the two. Thus, by modelling working memory and verbal intelligence, we were able to gain what we think is a clearer picture compared to studies that consider only one ability or those that examine individual difference variables in separate analyses (i.e., through simple correlations).
There are several points worth considering, before we move onto the length of unfilled pauses. Recall that consistent with our previous research, repetitions were related to verbal intelligence, and the regression path for the full model was −.29, which according to Cohen (1988) represents a medium effect size. An obvious issue that arises from this finding is what exactly is verbal intelligence and why might it be important to avoid repeating oneself? Answers to these questions usually begin by acknowledging that verbal intelligence is primarily based on crystallised knowledge, or knowledge acquired through prior experience (e.g., Bates & Stough, 1997; Carroll, 1993; Deary, 2001; Spearman, 1927). Explanations (e.g., Luke, Henderson, & Ferreira, 2015) tend to then highlight the quality of experience leading to quality of lexical representations, including a greater number of lemmas, a greater specificity of individual lemmas, and faster access. However, performance on individual verbal intelligence subtests might be slightly distinct.
The second point concerns a study by Lake et al. (2011), who assessed individuals with high-functioning forms of ASD to investigate speaker-oriented versus listener-oriented disfluency (see also, Belardi & Williams, 2009; Shriberg et al., 2001; Suh et al., 2014; Tager-Flusberg, Paul, & Lord, 2005; Thurber & Tager-Flusberg, 1993). Lake et al. showed a dissociation in which individuals with High-Functioning Autism (HFA) produced fewer filled pauses and repairs, and more unfilled pauses and repetitions when compared to typically developing controls (cf. Engelhardt, Alfridijanta, McMullon, & Corley, 2017). They argued that the former are listener-oriented and the latter are speaker-oriented. Before delving into the dissociation between the different types of disfluency, it is important to note that Lake et al. examined production in dialogue and employed different methods for assessing disfluencies, most notably their threshold for unfilled pauses was much longer. If we consider first repetitions and repairs, which were assessed similarly in the two studies, Lake et al. argued that they are due to different underlying causes, repetitions to speaker-internal factors and repairs to listener-oriented factors. Our results showed that distinct individual difference variables were related to repetitions and repairs (verbal intelligence and working memory, respectively); however, the direction of the effects were the same (i.e., lower functioning individuals produced more disfluency). If speakers produce repairs for the benefit of the listener, then one might expect the direction of the effect to be reversed: Higher ability individuals should be more attuned and accommodating to listeners’ needs compared to lower ability individuals. Again, it is important to bear in mind that there may be task differences that have a role in these discrepancies, but at the same time, our task is the presumably easier of the two. The Lake et al. task was much more spontaneous and involved interacting with an experimenter, whereas our task was essentially scripted monologue. 11 There are two other bits of evidence that support a negative relationship between cognitive abilities and repair disfluencies. First, Engelhardt et al. (2013) showed a negative relationship between repairs and inhibitory control. Second, in this study, most of the participants also completed the AQ questionnaire, which is widely used to assess autism spectrum symptoms in typically developing individuals. However, results show no correlation between repairs and total AQ scores (r = −.001, p = .99, N = 75), and all six of the subscales were likewise non-significant (all rs < .15, ps > .20). Thus, the Lake et al. study is the only one we are aware of to suggest that repairs are listener-oriented.
Finally, we think it is important to highlight what these results mean moving forward in terms of both practical and theoretical significance. What we have shown in this study is that there are some reasonably strong associations between individual difference variables and the production of different types of disfluency. At this point, we cannot say for certain whether the two are causally related or merely correlational. However, given the clinical work looking at ADHD and ASD, and the increased rates of disfluency in both disorders, we strongly suspect that there are causal links between individual differences in executive functioning and intelligence and the ability to produce fluent language outputs. Similar, causally based conclusions are routinely made in the executive function literature (e.g., Miyake & Friedman, 2012), which we reviewed in the Introduction. Future work is needed to conclusively establish causal relationships, as well as greater specificity regarding which stages of language production are affected/related to individual differences.
Unfilled pause length
As part of the analysis protocol, we also assessed the length of all silent pauses longer than 250 ms. Results revealed that the length of unfilled pauses correlated with the Stroop task and perseveration errors, whereas the proportion of sentences with a disfluency did not. We found the correlations with length quite surprising. We chose a task that relied on memory with the idea that pausing and repeating would be associated with individual differences in memory ability, and if so, then we also assumed that the length of pauses should also be related to memory (or perhaps, speed of memory retrieval). Given that the executive function measures correlated with pause length, and given the fact that the executive function measures did not correlate with proportion of sentences with a disfluency (Stroop task, r = −.09 and perseveration errors, r = .02), we speculate that (a) pausing behaviour is moderately associated with memory (as one would expect with a high-demand memory task) and (b) that how the production system gets “back on track” following a disruption is more related to individual differences in cognitive control. Consistent with our speculations, a recent study by Hussey, Ward, Christianson, and Kramer (2015) found increased performance on a sentence comprehension task following stimulation of the left lateral prefrontal cortex, an area strongly suspected of being involved in executive control in the language domain. Unfortunately, in this study, we did not have multiple measures of inhibition and set shifting, and so, we are not able to model these executive functions in the same way we did with working memory and verbal intelligence. Thus, we are not in a position to comment on how much of the variance is shared and how much is unique.
Strengths and limitations
We think the main strengths of this study are (a) the relatively large sample and (b) that we modelled two individual differences variables and tested them within a single model. The size of the sample (N = 78) and the size of test battery are unique for language production research, which tends to require more time-consuming analyses. There are, however, several limitations. The first one is that the task we used required rote memorisation of a sentence, which is very much unlike everyday speaking situations. However, we thought that a controlled speaking task was necessary given our goals, and thus, we employed a task in which all participants produced the same words and the same syntactic structures, which ensures equal task demands across the entire sample. A second limitation is that we are not in a position to model inhibition or set shifting. This is important given the correlations with the length of unfilled pauses. However, when designing individual difference studies, there are usually limitations on the number of measures that can be administered, and thus, some sacrifices must be made. However, the correlations between Stroop performance and perseveration errors provide an obvious future direction for looking at the role of cognitive control in the length of speech pauses. Another obvious avenue for future research is to consider the role of attention, which has a tendency to be the odd man out in investigations of executive function (e.g., Jou & Harris, 1992; Miyake et al., 2000).
Conclusion
In this study, we found that repetitions were significantly related to individual differences in verbal intelligence, which is consistent with Engelhardt et al. (2010). Unfilled pauses and repairs, in contrast, were marginally more related to the working memory measures compared to verbal intelligence. That is, model fits were marginally decreased when the pathway from disfluency to working memory was eliminated. To our knowledge, no study has previously investigated the length of pauses in speech and individual differences. The extent to which language production is influenced by individual differences has only recently become the topic of empirical studies. In our view, individual difference studies have the potential to shed new light and to expand theoretical debates in language production. As Underwood (1975) eloquently argued more than 40 years ago, individual differences studies are one of the key drivers of theory construction (Levelt, 1989) and modelling efforts (Dell, Burger, & Svec, 1997; Dell, Chang, & Griffin, 1999; Samuelson, Jenkins, & Spencer, 2015). However, there are a range of important research questions relating to individual differences. For example, do individual differences in executive function (e.g., inhibition) affect particular stages or processes of sentence production (e.g., lemma selection) (Berg & Schade, 1992; Wheeldon & Levelt, 1995)? With regard to monitoring, does the production system rely on a domain-general error monitoring mechanism (Holroyd & Coles, 2002) or is the language monitoring device domain-specific (Ganushchak & Schiller, 2008; Oomen & Postma, 2001, 2002)? We think these results shed new light on the factors influencing the production of covert disfluency in sentence production, and our hope is that this study sparks future research that considers how individual differences in executive function and intelligence impact the fluency of speech outputs.
Supplemental Material
QJE-STD_17-285.R2-Supplementary_Materials – Supplemental material for Individual differences in the production of disfluency: A latent variable analysis of memory ability and verbal intelligence
Supplemental material, QJE-STD_17-285.R2-Supplementary_Materials for Individual differences in the production of disfluency: A latent variable analysis of memory ability and verbal intelligence by Paul E Engelhardt, Mhairi EG McMullon and Martin Corley in Quarterly Journal of Experimental Psychology
Footnotes
Acknowledgements
We thank Sean Veld for assisting with the materials, experiment programming, and data collection, and Yasmine Haggar and Peter Kirk for help coding the data set.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
This research was funded by a British Academy Small Grant (SG112155) awarded to Paul Engelhardt (PI) and Martin Corley (Co-I).
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
