Gesture-Speech Interaction Beyond Planning: Evidence from Perturbations During Iconic Gesture and Speech Execution

Abstract

Gesture and speech are tightly coordinated during communication, yet the mechanisms underlying this coordination remain debated. The interactive view proposes that gesture and speech can influence each other not only during planning but also during execution, whereas the ballistic view holds that interaction is limited to pre-execution stages. To test these competing theories, four experiments investigated whether disrupting one modality affects the execution of the other. Participants named and gesturally depicted geometric shapes and motion events while gesture or speech was disrupted using virtual reality, motion tracking, delayed visual or auditory feedback, and stimulus changes. In all experiments, disrupting one modality consistently prolonged the execution time of the other modality. Moreover, the magnitude of execution prolongation in one modality was related to that in the other. These results show that gesture and speech remain continuously and bidirectionally connected during execution, providing strong support for the interactive view.

Keywords

gesture-speech interaction interactive view iconic gesture virtual reality motion tracking gesture production model

Introduction

Human communication is not limited to spoken language. Speakers often use hand gestures that are closely integrated with speech, forming a unified communicative system (Kendon, 2004; McNeill, 1992, 2005, 2012; McNeill & Duncan, 2000). Research shows that gesture and speech are coordinated at both linguistic and temporal levels (see Wagner et al., 2014 for a review). For example, when describing rotational motion, speakers frequently accompany their words with circular hand movements that illustrate the same event (Chu & Kita, 2008, 2016; Kendon, 1980; McNeill, 1992). Experimental studies further indicate that gesture production is influenced by linguistic properties (Kita & Özyürek, 2003), and the timing of related words (Leonard & Cummins, 2011; Loehr, 2007; McClave, 1998; Tuite, 1993). This close coordination suggests interaction between gesture and speech systems, although the specific stages of production at which this interaction occurs remain debated.

Gesture and speech production are often described as involving multiple processing stages. Gesture production typically includes a planning stage in which gesture content, form, and movement timing are determined, followed by an execution stage in which the movement is performed (De Ruiter, 1998; Kita & Özyürek, 2003; Krauss et al., 2000). Speech production is generally described as involving a conceptualizing stage for planning speech content, a formulating stage for retrieving semantic, syntactic, and phonological representations, and an articulation stage for producing speech (e.g., Caramazza, 1997; Dell et al., 1997; Levelt, 1989; Levelt et al., 1999; Rapp & Goldrick, 2000). Although these stages are well defined within each modality, it remains unclear how and when the gesture and speech systems interact during their planning and execution phases.

Two main views have been proposed regarding when the gesture and speech systems can interact during production. According to the ballistic view, interaction between the two systems is limited to their planning stages; once gesture or speech has begun, the two systems operate independently and can no longer influence each other (Levelt et al., 1985). In contrast, the interactive view proposes that gesture and speech can exchange information not only during planning but also during execution (Chu & Hagoort, 2014). The two views, therefore, agree that gesture planning can interact with speech conceptualization or formulation, but they diverge on whether interaction remains possible after execution has started.

Levelt et al. (1985) conducted the first experimental test of the interaction between gesture and speech by investigating how disrupting gesture execution affects speech production. Participants were asked to point to and name a briefly illuminated light using expressions such as “this light” or “that light.” Gesture execution was disrupted by attaching a 1,600 g weight to the wrist during either the early or middle phase of the pointing movement, and the effect on speech timing was measured. Speech onset was delayed only when the perturbation occurred at the early phase of gesture execution, but not when it occurred in the middle phase. Based on these findings, the authors concluded that once gesture execution begins, gesture and speech operate in an almost ballistic manner.

In Levelt et al. (1985), however, gesture execution was disrupted by adding a load to the wrist, which interfered with movement through kinaesthetic feedback. When the load was applied, participants needed to recalibrate movement parameters and adjust the force to overcome the unexpected resistance. This recalibration took time, especially because visual feedback generally plays a more dominant role in guiding hand movements compared with kinaesthetic feedback (Welch & Warren, 1986). As a result, when gesture execution was disrupted at the middle phase, the motor system may not have had sufficient time to influence the speech system before speech onset.

In contrast, when gesture execution was visually disrupted in Chu and Hagoort (2014), speech onset was delayed even if the disruption occurred at the late phase of gesture execution. In that study, participants pointed to and named a target light in virtual reality (VR), with gesture execution disrupted by shifting or freezing visual feedback of the hand or by changing the target light’s position. These disruptions were applied at early, middle, or late phases of gesture execution, and speech onset was delayed in all three instances. The same study also investigated how speech disruption affected gesture execution by changing the colour of the target light while participants named it. When speech was disrupted at the early, middle, or late phases of gesture execution, it was prolonged in all three instances. These results show that the gesture and speech systems can influence each other even when their execution is disrupted at the late phase, providing strong support for the interactive view.

Despite evidence for interaction during the execution phase, alternative theories suggest that speech and gesture coordination does not rely on ongoing interaction during execution. One such theory is the self-entrainment model, which explains coordination as the result of rhythmic coupling between speech and gesture, with synchronization emerging from synchronized oscillatory processes rather than from continuous online adjustment (Rusiewicz, 2011). This view aligns closely with the ballistic view, assuming that coordination is established before execution and remains stable afterward. However, this theory has been challenged by findings showing that the temporal alignment between speech and gesture does not remain fixed when speech is disrupted, and ongoing bidirectional influence between the two modalities can be observed (Rusiewicz et al., 2014). Consistent with this, Pouw and Dixon (2019) reported that speech and gesture became more synchronized when speech was disrupted by delayed auditory feedback, with both beat and iconic gestures adjusting to changes in speech timing during narration. Moreover, work on individuals with reduced proprioceptive feedback shows that maintaining gesture-speech synchrony critically depends on visual control and biomechanical constraints (Pouw et al., 2022). Together, these findings suggest that stable rhythmic coupling alone cannot fully explain speech and gesture synchronization and indicate a dynamic interaction that continues even after gesture or speech has begun.

However, in Pouw and Dixon (2019) and Pouw et al. (2022), participants produced speech and gestures spontaneously while narrating a cartoon, which limited experimental control over the exact timing of perturbations. Although delayed auditory feedback was used to disrupt speech, the perturbation could not be selectively applied to the gesture execution phase. Consequently, it is unclear whether the observed changes in gesture resulted from interactions during gesture execution itself or from adjustments made earlier during gesture planning. Some of these limitations were addressed in Chu and Hagoort (2014), where gesture or speech was perturbed at specific moments. In that study, however, participants typically began speaking only about 100 ms before completing their pointing gesture. As a result, in most gesture perturbation trials, gesture perturbation mainly affected speech planning rather than speech execution. Therefore, it remains unclear whether disrupting gestures can influence speech execution. Additionally, it is unclear whether findings based on pointing gestures in Chu and Hagoort (2014) can be generalized to other types of gestures, especially iconic gestures, which are more common in everyday communication.

The present study aimed to address these two critical issues by testing whether disrupting gesture or speech production can influence the execution of the other system when individuals depicted shapes and motion with speech and iconic gestures. To achieve this, the study used a VR setup combined with motion tracking techniques to selectively disrupt either gesture or speech during the execution phase of the other system, and then measured the resulting effects on the execution time of the other system. In the VR environment, participants were shown shapes and motion events and instructed to both gesturally depict and verbally describe them. A motion sensor was attached to the tip of the participant’s right index finger (see Figure 1a). Hand movements were tracked in real time and displayed in the virtual environment as a white ball visible to the participant (see Figure 1b).

Figure 1.

(a) Experimental setup with a motion tracking marker on the participant’s right index finger and a centrally located start button. (b) Corresponding virtual reality display showing the start button and the tracked fingertip.

Four experiments were conducted. Experiments 1 and 2 examined whether perturbing gesture influences speech timing. Gestures were disrupted either by delaying visual feedback (Experiment 1) or by enlarging the target shape (Experiment 2), and their effects on speech onset (S-onset) and speech execution (S-exec) times were measured. Experiments 3 and 4 investigated whether perturbing speech affects gesture execution. Speech was disrupted either by delaying auditory feedback (Experiment 3) or by changing the target shape’s colour (Experiment 4), and the impacts on gesture onset (G-onset) and gesture execution (G-exec) times were measured. According to the interactive view, perturbing one modality after execution of the other modality has begun, should prolong the execution time in the other modality, whereas the ballistic view predicts no such effects once execution has started.

Experiment 1

In Experiment 1, participants gesturally depicted geometric shapes and motion events and/or named them. Gestures were perturbed by delaying visual feedback of gesture in VR, and the effects on S-onset time and S-exec time were measured. According to the interactive view, disrupting gesture should delay speech onset and prolong speech execution. Moreover, greater prolongation of gesture execution should be associated with larger delays in speech onset and greater prolongation of speech execution. In contrast, the ballistic view predicts that disrupting gesture should have no effect on either S-onset time or S-exec time.

Method

Participants

Nineteen native Dutch speakers participated. All were right-handed and had normal or corrected-to-normal vision. Participants received monetary compensation. Two participants were excluded due to issues with audio recording and motion tracking. The final sample consisted of 17 participants (10 females, 7 males; mean age = 21 years, SD = 3.02). The sample size is comparable to that used in Experiments 1 to 4 (Ns = 16–17) of Chu and Hagoort (2014), which investigated the effects of perturbing pointing gestures on speech timing using a similar within-subject design. Ethical approval was granted by the Ethics Board of the Faculty of Social Sciences, Radboud University.

Apparatus

Participants sat at a table with their upper bodies approximately 10 cm from the table edge. A start button was located along the centerline of the table, 40 cm from the participant’s upper body. Visual stimuli were displayed using an NVIS nVisor SX stereo head-mounted display. A marker was attached to the tip of the participant’s right index finger, and finger movement was tracked with ARTtrack3 infrared tracking cameras. Detailed specifications of the head-mounted display and tracking system are reported in Chu and Hagoort (2014). Speech responses were recorded with a wireless Sennheiser microphone attached to the head-mounted display. The same apparatus was used for all four experiments.

Design and Procedure

Each trial started when the participant pressed the start button. After a random interval drawn from a normal distribution with a mean of 1,000 ms and a standard deviation of 150 ms, a stimulus was presented in VR. The stimulus was either a geometric shape (i.e., circle, triangle, or square) or a motion event (i.e., a ball sliding, bouncing, or rolling across a table).

In the gesture-and-speech condition, participants were instructed to name and gesturally depict the shape or motion event by drawing it in the air with their right index finger. Speech responses consisted of single-word labels, such as cirkel (“circle”) or rollen (“roll”). Participants were told to respond as naturally as possible, as in everyday communication, and to respond both accurately and quickly. They were not informed about the delay in visual feedback or about any requirement to synchronize gesture and speech. In the gesture-only condition, participants were instructed to depict the shape or motion event solely through gestures without speaking. In the speech-only condition, participants were instructed to name the shape or motion event without producing any gestures. Example videos illustrating the stimuli and participant responses are provided in the Supplemental Material.

The experiment consisted of nine blocks, with three blocks for each of the gesture-and-speech, gesture-only, and speech-only conditions. There were 48 trials in each gesture-and-speech block, 48 trials in each gesture-only block, and 24 trials in each speech-only block. Within each block, trials were divided into two sub-blocks: one with shape depiction trials and the other with motion event depiction trials, with each sub-block comprising half of the trials. The order of the three conditions and the two sub-blocks within each block was counterbalanced across participants.

In the gesture-and-speech and gesture-only conditions, visual feedback of finger movement in VR was delayed by either 117 or 317 ms after the actual finger movement. Each delay occurred equally often and was randomly distributed within each sub-block. Due to system processing constraints, there was a fixed 117 ms delay between the participant’s actual hand movement and the displayed movement of the white ball: approximately 50 ms for motion tracking and 67 ms for visual rendering. Because of this 117 ms minimum system delay, a no-delay condition could not be implemented. In the speech-only condition, no visual feedback delay was applied, as no gestures were produced.

Six practice trials were presented before the first block of each condition. In all practice trials of the gesture-and-speech and gesture-only conditions, the visual feedback delay was set at 117 ms.

Gesture and Speech Timing Measures

Four measures of gesture and speech timing were recorded across all four experiments. (1) G-onset time: the interval between stimulus presentation and the start of the gesture stroke (i.e., a meaningful movement that depicted the geometric shape or motion event). Therefore, this measure includes the time required for gesture planning and preparatory movements (e.g., lifting or positioning the hand before the stroke). (2) G-exec time: the duration from the onset to the end of the gesture stroke, corresponding to the period during which the gesture depicted the shape or motion event. (3) S-onset time: the interval between stimulus presentation and the start of the spoken response. (4) S-exec time: the duration from the onset to the end of the spoken response. The onset and the end of gesture strokes and spoken responses were manually identified using ELAN (Wittenburg et al., 2006), a software package for frame-by-frame annotation of video and audio recordings.

Results and Discussion

A total of 126 error trials (5.15% of all trials) in the gesture-and-speech condition, 113 error trials (4.62% of all trials) in the gesture-only condition, and 63 error trials (5.15% of all trials) in the speech-only condition were excluded. Trials were excluded if participants initiated gestures or speech before stimulus presentation, gestured or spoke when not required, failed to gesture or speak when required, produced an incorrect gesture or verbal response, or if video or audio recording errors occurred. Table 1 presents the descriptive statistics for G-onset time, G-exec time, S-onset time and S-exec time across all conditions, including the speech-only condition, which is reported for completeness but was not included in the inferential analyses of Experiment 1.

Table 1.

Means and Standard Deviations of G-Onset Time, G-Exec Time, S-Onset Time, and S-Exec Time (in Milliseconds) Across Conditions in Experiment 1.

Modality	Visual feedback delay (ms)	G-onset time		G-exec time		S-onset time		S-exec time
Modality	Visual feedback delay (ms)	M	SD	M	SD	M	SD	M	SD
Gesture-and-speech	117	882.24	323.96	1288.82	264.92	1154.95	331.31	733.72	132.40
Gesture-and-speech	317	895.84	328.86	1484.10	316.93	1198.14	357.81	764.12	140.29
Gesture-only	117	917.93	349.54	1436.67	355.73
Gesture-only	317	934.37	377.91	1683.93	383.11
Speech-only	No visual feedback					950.31	159.06	705.93	102.42

Note. SD = standard deviation.

Effect of Visual Feedback Delay on G-Onset Time

In Experiment 1, visual feedback of the gesture was delayed from the moment the hand started moving, including the period of gesture preparation before the gesture stroke. However, because visual feedback only becomes informative after the gesture stroke has begun, G-onset time was not expected to be affected by the visual feedback delay.

G-onset times were submitted to a 2 × 2 analysis of variance (ANOVA) with modality (gesture-and-speech, gesture-only) and visual feedback delay (117, 317 ms) as independent variables. There was no main effect of modality (F[1, 16] = 0.26, p = .617, η_p² = .02), and no main effect of visual feedback delay (F[1, 16] = 2.85, p = .111, η_p² = .15). The interaction between modality and visual feedback delay was also not significant (F[1, 16] = 0.03, p = .856, η_p² = .002). These results confirm that G-onset time was unaffected by visual feedback delay in either modality. The lack of an effect on G-onset time indicates that the visual feedback delay did not disrupt gesture planning or preparation, allowing effects on speech timing to be attributed specifically to the perturbation of gesture execution.

Effect of Visual Feedback Delay on G-Exec Time

To test the effectiveness of visual feedback delay manipulation on gesture stroke execution, G-exec times were submitted to a 2 × 2 ANOVA with modality (gesture-and-speech, gesture-only) and visual feedback delay (117, 317 ms) as independent variables. There was a main effect of modality (F[1, 16] = 25.49, p < .001, η_p² = .61), such that G-exec time was longer in the gesture-only condition than in the gesture-and-speech condition. There was a main effect of visual feedback delay (F[1, 16] = 120.28, p < .001, η_p² = .88), such that G-exec time was longer in the 317 ms delay condition than in the 117 ms delay condition. The interaction between modality and visual feedback delay was also significant (F[1, 16] = 7.81, p = .013, η_p² = .33), such that the effect of visual feedback delay on G-exec time was larger in the gesture-only condition (p < .001) than in the gesture-and-speech condition (p < .001). These results demonstrate that delaying visual feedback of the gesture effectively prolonged G-exec time, confirming that the visual feedback delay manipulation successfully disrupted gesture execution.

The main effect of modality and its interaction with visual feedback delay suggests that gesture execution was more prolonged and more susceptible to visual feedback perturbation when gestures were produced in isolation than when they were produced along with speech. One possible explanation is that, in the gesture-and-speech condition, gesture execution was temporarily constrained by the simultaneous production of speech, which may have restricted both the total duration of the gesture execution and the extent to which it could be prolonged in response to delayed visual feedback. Conversely, when gestures were produced without speech, participants might have relied more on visual feedback to guide the unfolding movement and allowed greater temporal flexibility in completing the gesture, leading to longer execution times overall and a greater effect of visual feedback delay. This pattern supports the interactive view, suggesting that concurrent speech constrains gesture execution in real time and reduces its vulnerability to perturbations in visual feedback.

Effects of Visual Feedback Delay on S-Onset Time and S-Exec Time

To test whether disrupting gesture execution affected speech production, paired-sample t-tests were conducted to compare S-onset time and S-exec time between the 117 and 317 ms delay trials in the gesture-and-speech condition. Both S-onset time and S-exec were longer in the 317 ms delay trials than in the 117 ms delay trials (S-onset: t[16] = 3.56, p = .003, Cohen’s d = 0.86; S-exec: t[16] = 8.47, p < .001, Cohen’s d = 2.06). These results indicate that participants delayed their S-onset time and prolonged their S-exec time when gesture execution was prolonged, consistent with the interactive view that gesture and speech production can influence each other during the execution phase.

Relationship Between G-Exec Time and Speech Timing

To assess whether changes in speech timing are linked to the level of gesture execution disruption, Pearson correlation analyses were conducted between differences in G-exec time and differences in S-onset and S-exec times across the two delay conditions in the gesture-and-speech condition. G-exec time differences were positively correlated with S-onset time differences (r[15] = .55, p = .021; see Figure 2a) and S-exec time differences (r[15] = .62, p = .008; see Figure 2b), indicating that greater prolongation of gesture execution was associated with larger delays in speech onset and greater prolongation of speech execution.

Figure 2.

Scatter plots showing the relationships between G-exec time differences and S-onset time differences (a) and S-exec time differences (b) in the gesture-and-speech condition in Experiment 1.

These results are consistent with the interactive view, which holds that gesture and speech can influence each other during execution. An alternative explanation is that speech timing was independently affected by attentional disruption caused by the unexpected visual feedback delay. This explanation is unlikely for two reasons. First, the visual feedback delay was introduced during the preparatory phase of the gesture, on average 548.61 ms (SD = 273.44 ms) before speech onset, making it unlikely that any transient surprisal persisted into the speech execution phase. Second, prolongation of gesture execution was positively correlated with both delays in speech onset and prolongation of speech execution, indicating that speech timing varied with the degree of gesture disruption. Experiment 2 was designed to more strictly eliminate this alternative explanation.

Furthermore, because gesture perturbation in Experiment 1 occurred before speech execution began, it remained unclear whether speech execution would still be affected if gesture disruption happened after speech onset. Experiment 2 aimed to address this issue.

Experiment 2

In Experiment 2, participants gesturally depicted geometric shapes (i.e., triangle, circle, and square) and/or named them. To test whether disrupting gesture after speech onset affects speech execution, the size of the stimulus shape was enlarged after participants had begun speaking, and participants were instructed to adjust their gesture accordingly. According to the interactive view, gesture perturbation after speech onset should prolong speech execution.

A speech-only condition was included to control for nonspecific effects of unexpected visual input caused by shape enlargement. In this condition, participants named the shape without gestures, and the shape was enlarged after speech onset in the same manner as in the gesture-and-speech condition. Because the enlargement did not change the shape’s identity, speech execution was not expected to be affected in the speech-only condition.

Method

Participants

Nineteen native Dutch speakers participated. All were right-handed and had normal or corrected-to-normal vision. Participants received monetary compensation. One participant was excluded due to failure in motion tracking. The final sample consisted of 18 participants (7 females, 11 males; mean age = 23 years, SD = 3.19). The sample size is comparable to that used in Experiments 1 to 4 (Ns = 16–17) of Chu and Hagoort (2014), which investigated the effects of perturbing pointing gestures on speech timing using a similar within-subject design.

Design and Procedure

Each trial began with the presentation of a geometric shape (i.e., circle, triangle, or square) displayed in one of two colours (i.e., yellow or blue). In the gesture-and-speech condition, participants were instructed to gesturally depict the shape using both hands and to name the shape along with its colour, such as “een gele cirkel” (“a yellow circle”). In this experiment, participants named the shape using three words (e.g., “een gele cirkel”) instead of a single word (e.g., “cirkel,” as in Experiment 1). This was done to ensure that speech execution was still ongoing when gesture was perturbed, which increased the likelihood of detecting effects of gesture perturbation on speech execution. In addition, participants were instructed to use both hands to depict the shapes (e.g., holding both hands opposite each other with a ball shape to represent a circle) rather than using an index-finger drawing gesture. A pilot study showed that when drawing gestures were used, participants typically completed both the gesture and the spoken response before producing a second gesture to depict the enlarged shape in shape-enlarged trials. In such cases, gesture perturbation occurred only after speech execution had finished, making it impossible to assess the effect of gesture perturbation on speech execution. In contrast, when participants used both hands to depict the shapes, they consistently enlarged their gestures while speech was still ongoing, allowing the examination of how gesture perturbation affects speech execution. In addition, no visual feedback of hand movement was displayed in VR in Experiment 2, because the system did not support reliable real-time tracking and rendering of the full hand movements.

In the gesture-only condition, participants were instructed to use both hands to depict the shape without speaking. In the speech-only condition, participants were instructed to name the shape along with its colour without producing any gestures.

In the shape-enlarged trials, the target shape was enlarged to twice its original size. The enlargement occurred over 17 ms. In the gesture-and-speech condition and the speech-only condition, shape enlargement was triggered when speech onset was detected. In the gesture-only condition, participants did not speak, and therefore shape enlargement was triggered at an estimated time based on the average S-onset time from 12 non-shape-enlarged gesture-and-speech trials presented at the beginning of the experiment. Example stimuli and participant responses are provided in the Supplemental Material.

The experiment consisted of nine blocks, with three blocks for each of the gesture-and-speech, gesture-only, and speech-only conditions. The order of these conditions was counterbalanced across participants. Each block included 48 trials, half of which involved shape enlargement and half did not. Shape enlargement was equally frequent across all shape-colour combinations. Six practice trials without shape enlargement were presented before the first block of each condition.

Results and Discussion

A total of 124 error trials (4.78% of all trials) in the gesture-and-speech condition, 149 error trials (5.75% of all trials) in the gesture-only condition, and 75 error trials (2.89% of all trials) in the speech-only condition were excluded. Trial exclusion criteria were identical to those in Experiment 1. Table 2 presents the descriptive statistics for G-onset time, G-exec time, S-onset time, and S-exec time in each condition.

Table 2.

Means and Standard Deviations of G-Onset Time, G-Exec Time, S-Onset Time, and S-Exec Time (in Milliseconds) Across Conditions in Experiment 2.

Modality	Shape enlargement	G-onset time		G-exec time		S-onset time		S-exec time
Modality	Shape enlargement	M	SD	M	SD	M	SD	M	SD
Gesture-and-speech	Non-enlarged	562.49	148.69	1634.78	357.26	1032.44	186.72	1183.69	171.63
Gesture-and-speech	Enlarged	568.57	163.68	1900.84	393.82	1039.98	178.93	1206.22	176.00
Gesture-only	Non-enlarged	422.90	87.83	1513.38	393.04
Gesture-only	Enlarged	427.12	83.75	1844.35	453.54
Speech-only	Non-enlarged					916.08	126.30	1232.60	183.95
Speech-only	Enlarged					901.82	118.72	1226.44	182.74

Note. SD = standard deviation.

Effect of Shape Enlargement on G-Onset Time

In Experiment 2, stimulus shapes were enlarged when speech onset was detected in the gesture-and-speech condition, or at an estimated S-onset time in the gesture-only condition. Because gesture onset preceded speech onset by an average of 470.46 ms (SD = 209.96 ms) in the non-shape-enlarged trials in the gesture-and-speech condition, G-onset time was not expected to be affected by the shape enlargement manipulation.

G-onset times were submitted to a 2 × 2 ANOVA with modality (gesture-and-speech, gesture-only) and shape enlargement (enlarged, non-enlarged) as independent variables. There was a main effect of modality (F[1, 17] = 24.43, p < .001, η_p² = .59), such that G-onset time was shorter in the gesture-only condition than in the gesture-and-speech condition. There was no main effect of shape enlargement (F[1, 17] = 2.00, p = .176, η_p² = .11). The interaction between modality and shape enlargement was also not significant (F[1, 17] = 0.05, p = .821, η_p² = .003). These results confirm that G-onset time was unaffected by shape enlargement in either modality. The absence of an effect on G-onset time indicates that the shape enlargement did not disrupt gesture planning or preparation, allowing effects on speech execution to be attributed specifically to the perturbation of gesture execution.

Effect of Shape Enlargement on G-Exec Time

To test the effectiveness of the shape enlargement manipulation, G-exec times were submitted to a 2 × 2 ANOVA with modality (gesture-and-speech, gesture-only) and shape enlargement (enlarged, non-enlarged) as independent variables. There was a marginal main effect of modality (F[1, 17] = 4.11, p = .059, η_p² = .20), such that there was a trend that G-exec time was longer in the gesture-only trials than in the gesture-and-speech trials. There was a main effect of shape enlargement (F[1, 17] = 47.58, p < .001, η_p² = .74), such that G-exec time was longer in the shape-enlarged condition than in the non-shape-enlarged condition. The interaction between modality and shape enlargement was not significant (F[1, 17] = 2.47, p = .135, η_p² = .13). Therefore, shape enlargement consistently prolonged gesture execution, confirming the effectiveness of the gesture perturbation.

Unlike in Experiment 1, the main effect of modality was only marginal, and there was no significant interaction between modality and shape enlargement. This difference likely reflects the nature of the gesture perturbation in Experiment 2. In Experiment 1, delaying visual feedback of the moving hand directly perturbed the visuomotor control loop that guides ongoing movement. In contrast, in Experiment 2, no visual feedback of hand movement was available in VR, and shape enlargement primarily required adjusting gesture amplitude rather than recalibrating a perturbed visuomotor feedback loop. Consequently, gesture execution was prolonged in both modality conditions, but modulation by modality was weaker and did not yield a significant interaction.

Effects of Shape Enlargement on S-Onset Time

In Experiment 2, the stimulus shape was enlarged starting from the moment the spoken response was initiated, and therefore, S-onset time was not expected to be affected by the shape enlargement manipulation.

S-onset times were submitted to a 2 × 2 ANOVA with modality (gesture-and-speech, speech-only) and shape enlargement (shape-enlarged, non-shape-enlarged) as independent variables. There was a main effect of modality (F[1, 17] = 13.49, p = .002, η_p² = .44), such that S-onset time was shorter in the speech-only condition than in the gesture-and-speech condition. There was no main effect of shape enlargement (F[1, 17] = 0.19, p = .666, η_p² = .01). The interaction between modality and shape enlargement was also not significant (F[1, 17] = 1.85, p = .192, η_p² = .10). These results confirm that S-onset time was not affected by shape enlargement in either modality. Therefore, any effects of gesture perturbation on speech execution cannot be attributed to differences in speech planning.

Effects of Shape Enlargement on S-Exec Time

To test whether disrupting gesture after speech onset affects speech execution, and whether unexpected shape enlargement affects speech execution when gesture is absent, S-exec times were submitted to a 2 × 2 ANOVA with modality (gesture-and-speech, speech-only) and shape enlargement (enlarged, non-enlarged) as independent variables. There was a main effect of modality (F[1, 17] = 5.46, p = .032, η_p² = .24), such that S-exec time was longer in the speech-only condition than in the gesture-and-speech condition. There was no main effect of shape enlargement (F[1, 17] = 1.87, p = .189, η_p² = .10). However, there was a significant interaction between modality and shape enlargement (F[1, 17] = 6.08, p = .025, η_p² = .26). Follow-up simple effects analyses showed that S-exec time increased with shape enlargement only in the gesture-and-speech condition (p = .011), but not in the speech-only condition (p = .500). These results rule out the possibility that S-exec time and G-exec time were independently affected by unexpected visual input (i.e., the shape enlargement), indicating that prolonged speech execution resulted from gesture perturbation.

In the gesture-and-speech condition, participants enlarged their gestures, on average 626.13 ms (SD = 182.78 ms), after speech onset in the shape-enlarged trials. Given that the mean S-exec time in the non-enlarged trials was 1183.69 ms (SD = 171.63 ms), gesture disruption occurred after more than half of speech execution had already elapsed. Despite this late perturbation, speech execution was still prolonged, indicating that speech remains sensitive to gesture disruption well into execution, which supports the interactive view that gesture and speech can interact during execution.

Relationship Between G-Exec and S-Exec Time Prolongations

To assess whether changes in S-exec time correlated with the degree of gesture execution disruption, Pearson correlation analyses were conducted between differences in G-exec time and differences in S-exec time across the shape-enlarged and non-shape-enlarged trials in the gesture-and-speech condition. G-exec time differences were positively correlated with S-exec time differences (r[16] = .54, p = .020; see Figure 3), indicating that greater prolongation of gesture execution was associated with greater prolongation of speech execution.

Figure 3.

Scatter plot showing the relationship between G-exec time differences and S-exec time differences in the gesture-and-speech condition in Experiment 2.

Experiment 2 extends Experiment 1 by showing that gesture-speech interaction persists even when gesture disruption occurs after speech execution has started. Furthermore, Experiment 2 showed that disrupting gesture during speech execution led to prolonged speech execution only in the gesture-and-speech condition, but not in the speech-only condition. This pattern rules out explanations based on nonspecific effects of unexpected visual input. Notably, gesture enlargement occurred on average more than 600 ms after speech onset, yet speech execution was still prolonged, and the degree of speech prolongation scaled with the extent of gesture disruption. Along with Experiment 1, these findings strongly support the interactive view, demonstrating ongoing interaction between gesture and speech during execution.

Experiment 3

Whereas Experiments 1 and 2 examined how perturbing gestures affects speech execution, Experiments 3 and 4 tested the complementary prediction of the interactive view that perturbing speech affects gesture execution. In Experiment 3, speech was disrupted by delaying auditory feedback by 100 ms after speech onset while participants named and gesturally depicted shapes and motion events. If gesture and speech interact during execution, disrupting speech should prolong gesture execution, with greater speech disruption leading to greater gesture prolongation.

Method

Participants

Nineteen native Dutch speakers participated in the experiment (10 females, 9 males; mean age = 22 years, SD = 3.81). All were right-handed and had normal or corrected-to-normal vision. Participants received monetary compensation. The sample size is comparable to that used in Experiment 5 (N = 18) of Chu and Hagoort (2014), which investigated the effects of perturbing speech on pointing gesture timing using a similar within-subject design.

Apparatus

Participants’ speech was delayed using a Sonifex Redbox RB DS2 stereo audio delay synchroniser. The delayed speech was played back to participants through Sennheiser CX 215 in-ear earphones at 70 dB sound pressure level.

Design and Procedure

The tasks in the gesture-and-speech, gesture-only, and speech-only conditions were identical to those used in Experiment 1. Participants were not informed about the auditory feedback delay or about any requirement to synchronize gesture and speech. Example videos illustrating the stimuli and participant responses are provided in the Supplemental Material.

The experiment consisted of nine blocks, with three blocks for each of the gesture-and-speech, gesture-only, and speech-only conditions. There were 48 trials in each gesture-and-speech block, 48 trials in each speech-only block, and 24 trials in each gesture-only block. Within each block, trials were divided into two sub-blocks, one containing shape depiction trials and the other containing motion event depiction trials, with each sub-block comprising half of the trials. The order of the three conditions and the order of the two sub-blocks within each block were counterbalanced across participants.

In the gesture-and-speech and speech-only conditions, participants’ own speech was played back through earphones either without delay or with a 100 ms delay. The delayed and non-delayed trials were equally frequent and were randomly distributed within each sub-block. In the gesture-only condition, no auditory feedback delay was applied, as no speech was produced.

Six practice trials were presented before the first block of each condition. In all practice trials of the gesture-and-speech and the speech-only conditions, auditory feedback was not delayed.

Results and Discussion

A total of 60 error trials (2.19% of all trials) in the gesture-and-speech condition, 13 error trials (0.95% of all trials) in the gesture-only condition, and 26 error trials (0.95% of all trials) in the speech-only condition were excluded. Trial exclusion criteria were identical to those in Experiment 1. Table 3 presents the descriptive statistics for G-onset time, G-exec time, S-onset time, and S-exec time across all conditions, including the gesture-only condition, which is reported for completeness but was not included in the inferential analyses of Experiment 3.

Table 3.

Means and Standard Deviations of G-Onset Time, G-Exec Time, S-Onset Time, and S-Exec Time (in Milliseconds) Across Conditions in Experiment 3.

Modality	Auditory feedback delay	G-onset time		G-exec time		S-onset time		S-exec time
Modality	Auditory feedback delay	M	SD	M	SD	M	SD	M	SD
Gesture-and-speech	No delay	922.96	258.46	1285.25	331.45	976.02	264.87	752.86	159.33
Gesture-and-speech	100 ms delay	906.35	269.27	1309.64	338.55	967.65	277.41	844.63	185.58
Gesture-only	No auditory feedback	927.43	293.77	1419.94	349.17
Gesture-only	No auditory feedback	927.43	293.77	1419.94	349.17
Speech-only	No delay					877.25	239.44	678.53	90.40
Speech-only	100 ms delay					884.53	231.46	749.63	112.99

Note. SD = standard deviation.

Effect of Auditory Feedback Delay on S-Onset Time

In Experiment 3, auditory feedback delay occurred 100 ms after speech onset detection, so it should not affect S-onset time. S-onset times were submitted to a 2 × 2 ANOVA with modality (gesture-and-speech, speech-only) and auditory feedback delay (delayed, non-delayed) as independent variables. There was a main effect of modality (F[1, 18] = 6.98, p = .017, η_p² = .28), such that S-onset time was longer in the gesture-and-speech condition than in the speech-only condition. There was no main effect of auditory feedback delay (F[1, 18] = 0.01, p = .943, η_p² < .001). The interaction between modality and auditory feedback delay was also not significant (F[1, 18] = 1.27, p = .275, η_p² = .07). These results confirm that auditory feedback delay did not affect S-onset time in either modality, ensuring that any subsequent effects on gesture execution can be attributed specifically to the perturbation of speech execution.

Effect of Auditory Feedback Delay on S-Exec Time

To test the effectiveness of auditory feedback delay manipulation, S-exec times were submitted to a 2 × 2 ANOVA with modality (gesture-and-speech, speech-only) and auditory feedback delay (delayed, non-delayed) as independent variables. There was a main effect of modality (F[1, 18] = 16.18, p < .001, η_p² = .47), such that S-exec time was longer in the gesture-and-speech condition than in the speech-only condition. There was a main effect of auditory feedback delay (F[1, 18] = 114.87, p < .001, η_p² = .87), such that S-exec time was longer in the delay condition than in the no-delay condition. The interaction between modality and auditory feedback delay was also significant (F[1, 18] = 19.64, p < .001, η_p² = .52), such that the effect of auditory feedback delay on S-exec time was larger in the gesture-and-speech condition (p < .001) than in the speech-only condition (p < .001). These results demonstrate that delaying auditory feedback of speech effectively prolonged S-exec time, confirming that the auditory feedback delay manipulation successfully disrupted speech execution.

It is noteworthy that when visual feedback of gesture was delayed in Experiment 1, gesture execution was longer and more strongly affected in the gesture-only than in the gesture-and-speech condition. In contrast, when auditory feedback was delayed in Experiment 3, speech execution was longer and more strongly affected in the gesture-and-speech than in the speech-only condition. This pattern may arise because, when gesture execution is disrupted, concurrent speech provides a stable temporal framework that constrains gesture timing and limits the extent to which gesture execution can be prolonged. Conversely, when speech execution is disrupted, ongoing gesture production may impose additional demands on shared control processes, thereby amplifying the impact of auditory perturbation on speech execution. Regardless of the precise mechanisms involved, these results provide converging evidence, consistent with the interactive view, that gesture and speech mutually constrain each other online during execution.

Effects of Auditory Feedback Delay on G-Onset Time

In the gesture-and-speech condition, participants on average initiated their spoken response 56.05 ms (SD = 35.94 ms) after gesture stroke onset in the no-delay trials, thus G-onset time was not expected to be affected by the auditory feedback delay. Paired-sample t-tests showed that G-onset time did not differ between the delayed and non-delayed conditions (t[18] = 1.64, p = .119, Cohen’s d = 0.38). This absence of an effect indicates that auditory feedback delay did not influence gesture planning or preparatory processes, allowing any subsequent effects on gesture execution to be attributed specifically to the perturbation of speech execution.

Effects of Auditory Feedback Delay on G-Exec Time

To test whether disrupting speech execution affected G-exec time, paired-sample t-tests were conducted to compare G-exec time between the delayed and no-delay trials in the gesture-and-speech condition. Results revealed that G-exec time was longer when auditory feedback was delayed than when it was not (t[18] = 4.78, p < .001, Cohen’s d = 1.10). This result indicates that participants prolonged their G-exec time when speech execution was prolonged, consistent with the interactive view that gesture and speech production can influence each other during the execution phase.

Relationship Between S-Exec and G-Exec Time Prolongations

To assess whether prolongation of G-exec time was related to the prolongation of S-exec time, Pearson correlation analyses were conducted between S-exec time differences and G-exec time differences across the delayed and no-delay conditions in the gesture-and-speech condition. G-exec time differences were positively correlated with S-exec time differences (r[17] = .61, p = .006; see Figure 4) across the two delay conditions. This result indicates that greater prolongation of gesture execution was associated with greater prolongation of speech execution.

Figure 4.

Scatter plot showing the relationship between G-exec time differences and S-exec time differences in the gesture-and-speech condition in Experiment 3.

Experiment 3 demonstrates that perturbing speech execution consistently prolonged gesture execution, with gesture prolongation correlated with speech prolongation, indicating a systematic connection between the two modalities during execution. While consistent with the interactive view, the observed prolongation of gesture execution could potentially result from a nonspecific influence of auditory feedback delay on attention or motor control. Experiment 4 was therefore designed to address this alternative explanation.

Experiment 4

In Experiment 4, participants gesturally depicted geometric shapes (i.e., triangle, circle, or square) and/or named the shape along with its colour (i.e., blue or yellow). Speech execution was perturbed by changing the colour of the target shape after speech onset, requiring participants to update their spoken response. According to the interactive view, disrupting speech after gesture initiation should prolong gesture execution, and greater increases in S-exec time should be associated with greater increases in G-exec time.

A gesture-only condition was included to rule out effects of unexpected visual changes on gesture execution. Because colour was irrelevant to the gestural depiction of shape, gesture execution was not expected to be affected by colour change in the absence of speech.

Method

Participants

Eighteen native Dutch speakers participated. All were right-handed with normal or corrected-to-normal vision. Participants received monetary compensation. One participant was excluded due to a failure in audio recording. The final sample consisted of 17 participants (11 females, 6 males; mean age = 22 years, SD = 3.35). The sample size is comparable to that used in Experiment 5 (N = 18) of Chu and Hagoort (2014), which investigated the effects of perturbing speech on pointing gesture timing using a similar within-subject design.

Design and Procedure

In the gesture-and-speech condition and the speech-only condition, the colour of the target shape was changed in the colour-changed trials once speech onset was detected. Participants were instructed to update their spoken response to reflect the new colour of the shape. In the gesture-only condition, because no speech was produced, the colour change started at an estimated time based on each participant’s average S-onset time, calculated from 12 non-colour-changed gesture-and-speech trials presented at the beginning of the experiment.

The experiment consisted of nine blocks, with three blocks for each of the gesture-and-speech, gesture-only, and speech-only conditions. The order of these conditions was counterbalanced across participants. Each block contained 48 trials, with half of the trials involving a colour change and half involving no colour change. Colour changes occurred over 17 ms and were randomly distributed across shape and colour combinations. Six practice trials without colour change were presented before the first block of each condition.

Results and Discussion

A total of 167 error trials (6.82% of all trials) in the gesture-and-speech condition, 42 error trials (1.72% of all trials) in the gesture-only condition, and 214 error trials (8.74% of all trials) in the speech-only condition were excluded. Trial exclusion criteria were identical to those in Experiment 1. Table 4 presents the descriptive statistics for G-onset time, G-exec time, S-onset time, and S-exec time in each condition.

Table 4.

Means and Standard Deviations of G-Onset Time, G-Exec Time, S-Onset Time, and S-Exec Time (in Milliseconds) Across Conditions in Experiment 4.

Modality	Colour change	G-onset time		G-exec time		S-onset time		S-exec time
Modality	Colour change	M	SD	M	SD	M	SD	M	SD
Gesture-and-speech	Non-changed	754.37	133.96	1197.49	361.34	870.17	127.46	887.61	110.95
Gesture-and-speech	Changed	750.38	143.89	1282.14	335.26	870.27	130.24	1633.12	180.63
Gesture-only	Non-changed	688.63	147.19	1274.67	412.14
Gesture-only	Changed	689.47	149.42	1286.02	404.79
Speech-only	Non-changed					772.34	122.54	881.13	97.36
Speech-only	Changed					768.85	115.76	1576.53	160.11

Note. SD = standard deviation.

Effect of Colour Change on S-Onset Time

In Experiment 4, the colour change of the stimulus shape occurred after speech onset detection, thus it should not affect S-onset time. S-onset times were submitted to a 2 × 2 ANOVA with modality (gesture-and-speech, speech-only) and colour change (non-changed, changed) as independent variables. There was a main effect of modality (F[1, 16] = 56.86, p < .001, η_p² = .78), such that S-onset time was longer in the gesture-and-speech condition than in the speech-only condition. There was no main effect of colour change (F[1, 16] = 0.10, p = .753, η_p² = .01). The interaction between modality and colour change was also not significant (F[1, 16] = 0.27, p = .610, η_p² = .02). These results confirm that S-onset time was unaffected by colour change in either modality, ensuring that any subsequent effects on gesture execution can be attributed specifically to the perturbation of speech execution.

Effect of Colour Change on S-Exec Time

To test the effectiveness of auditory feedback delay manipulation, S-exec times were submitted to a 2 × 2 ANOVA with modality (gesture-and-speech, speech-only) and colour change (non-changed, changed) as independent variables. There was a main effect of modality (F[1, 16] = 6.23, p = .024, η_p² = .28), such that S-exec time was longer in the gesture-and-speech condition than in the speech-only condition. There was a main effect of colour change (F[1, 16] = 902.00, p < .001, η_p² = .98), such that S-exec time was longer in the colour-changed condition than in the non-colour-changed condition. The interaction between modality and colour change was also significant (F[1, 16] = 13.08, p = .002, η_p² = .45), such that the effect of colour change on S-exec time was larger in the gesture-and-speech condition (p < .001) than in the speech-only condition (p < .001). These results demonstrate that changing the colour of the stimulus shapes effectively prolonged S-exec time, confirming that the colour change manipulation successfully disrupted speech execution.

These results replicated those in Experiment 3 that speech execution was longer and more strongly affected by speech perturbation in the gesture-and-speech condition than in the speech-only condition, despite involving a different perturbation mechanism. These converging findings strengthen the conclusion that gesture and speech mutually constrain each other during execution.

Effect of Colour Change on G-Onset Time

In Experiment 4, the colour of stimulus shapes was changed upon speech onset detection. Since gesture onset preceded speech onset by an average of 115.85 ms (SD = 116.53 ms) in non-colour-changed trials in the gesture-and-speech condition, G-onset time was not expected to be affected by stimulus colour change. G-onset times were submitted to a 2 × 2 ANOVA with modality (gesture-and-speech, gesture-only) and colour change (non-changed, changed) as independent variables. There was a main effect of modality (F[1, 16] = 12.62, p = .003, η_p² = .44), such that G-onset time was longer in the gesture-and-speech condition than in the speech-only condition. There was no main effect of colour change (F[1, 16] = 0.10, p = .754, η_p² = .01). The interaction between modality and colour change was also not significant (F[1, 16] = 0.29, p = .597, η_p² = .02). These results confirm that colour change did not influence gesture planning or preparatory processes in either modality, allowing any subsequent effects on gesture execution to be attributed specifically to the perturbation of speech execution.

Effect of Colour Change on G-Exec Time

To test whether disrupting speech affects gesture execution, and whether unexpected colour change affects gesture execution in the absence of speech, G-exec times were submitted to a 2 × 2 ANOVA with modality (gesture-and-speech, gesture-only) and colour change (non-changed, changed) as the independent variables. It is worth noting that in approximately 27% of colour-changed trials (N = 310) in the gesture-and-speech condition, participants corrected their speech and named the new colour after gesture execution was completed. These trials were excluded from the analyses because speech disruption could no longer affect gesture execution.

There was no main effect of modality (F[1, 16] = 1.57, p = .228, η_p² = .09), and no main effect of colour change (F[1, 16] = 4.06, p = .061, η_p² = .20). However, there was a significant interaction between modality and colour change (F[1, 16] = 31.51, p < .001, η_p² = .66). Follow-up simple effects analyses showed that G-exec time of the colour-changed trials was longer than the non-colour-changed trials only in the gesture-and-speech condition (p = .004), but not in the gesture-only condition (p = .643). These results rule out the possibility that G-exec time and S-exec time were independently affected by unexpected visual input (i.e., the colour change), indicating that prolonged gesture execution resulted from speech perturbation.

Furthermore, in the gesture-and-speech condition, participants corrected their speech and named the new colour on average 769.10 ms (SD = 114.57 ms) after the gesture stroke had been initiated in the colour-changed trials. Given that the mean G-exec time in the non-colour-changed trials was 1197.33 ms (SD = 82.63 ms), it indicates that gesture execution can be prolonged even when speech disruption occurs after more than half of the gesture execution, supporting the interactive view that gesture and speech continuously interact during execution.

Relationship Between S-Exec and G-Exec Time Prolongations

To assess whether prolongation of G-exec time correlated with the prolongation of S-exec time, Pearson correlation analyses were conducted between S-exec time differences and G-exec time differences across the colour-changed and non-colour-changed conditions in the gesture-and-speech condition. G-exec time differences were not significantly correlated with S-exec time differences (r[16] = −.10, p = .690; see Figure 5) across the colour-changed and non-colour-changed conditions. This is likely because, in 96.35% of the colour-changed trials in the gesture-and-speech condition, participants first named the target shape with its original colour and then corrected their speech by naming it again with its new colour (e.g., “yellow triangle blue triangle”). In these trials, the prolongation of speech execution may not accurately reflect how much the original speech execution was affected.

Figure 5.

Scatter plot showing the relationship between G-exec time differences and S-exec time differences in the gesture-and-speech condition in Experiment 4.

Experiment 4 extends Experiment 3 by showing that disrupting speech through an unexpected colour change can prolong gesture execution. This effect was observed only in the gesture-and-speech condition and not in the gesture-only condition, ruling out the possibility that gesture execution was independently affected by the unexpected visual input introduced by the colour change. Notably, speech correction occurred on average more than 700 ms after gesture onset, yet gesture execution was still prolonged. Along with Experiment 3, these results provide converging evidence that disrupting speech can lead to prolongation of gesture execution, consistent with the interactive view.

General Discussion

The present study examined whether gesture and speech can interact after execution has begun. The ballistic view (Levelt et al., 1985) holds that interaction is restricted to pre-execution stages, such that once either modality has been initiated, the two systems proceed independently. In contrast, the interactive view (Chu & Hagoort, 2014) proposes continued information exchange during execution. Across four experiments, perturbing either gesture or speech during the production of iconic depictions consistently prolonged execution in the other modality. These converging results indicate that gesture and speech remain dynamically coupled during execution, challenging the ballistic account.

Effect of Gesture Perturbation on Speech Production

Replicating and extending findings from Chu and Hagoort (2014), Experiments 1 and 2 showed that perturbing gesture production consistently prolonged speech execution, with greater gesture disruption leading to greater speech prolongation. These converging patterns indicate that speech timing remains sensitive to ongoing changes in gesture production after speech has begun.

Importantly, this effect was observed across different perturbation methods. In Experiment 1, gestures were disrupted by delayed visual feedback, affecting online motor control from the early stage of the movement. In Experiment 2, gestures were perturbed indirectly by enlarging the target shape after speech onset, which required adjusting an already ongoing gesture. Despite these differences, both manipulations produced consistent effects on speech execution, suggesting that gesture perturbation influences speech execution regardless of the type or timing of the perturbation.

Crucially, in Experiment 2, speech prolongation occurred only in the gesture-and-speech condition, and not in the speech-only condition, ruling out nonspecific effects of visual change. Notably, gesture enlargement occurred after more than half of the spoken response had been produced, yet speech execution was still prolonged. Together, these findings challenge the ballistic account and support the interactive view that gesture and speech remain dynamically coupled throughout execution.

Effect of Speech Perturbation on Gesture Production

Experiments 3 and 4 tested the complementary prediction that perturbing speech affects gesture execution. Chu and Hagoort (2014) have shown that gesture execution can be prolonged when speech is disrupted at the speech planning stage. The present study extends these findings by demonstrating that speech perturbations can also influence gesture execution even when both modalities are already underway. Across two distinct manipulations, delayed auditory feedback (Experiment 3) and stimulus colour change (Experiment 4), speech disruption consistently prolonged gesture execution. In Experiment 3, the magnitude of gesture prolongation scaled with the magnitude of speech prolongation, indicating a systematic coupling between the two modalities during execution. In Experiment 4, gesture execution was prolonged even when speech perturbation occurred well after gesture initiation. These findings provide strong support for the interactive view by showing that gesture execution remains sensitive to speech disruptions during the execution stage.

Although previous studies on natural communication have demonstrated tight temporal coordination between gesture and speech (e.g., Morrel-Samuels & Krauss, 1992; Pouw & Dixon, 2019), such findings could reflect pre-planned synchronization. In contrast, the current experiments directly disrupted speech during ongoing gesture production and demonstrated causal effects on G-exec time. This suggests that gesture-speech coordination cannot be entirely explained by pre-planned synchronization. Instead, it reflects continuous interaction during execution, supporting a model of a bidirectional and dynamically coupled production system.

Modality-Specific Constraints on Execution Under Perturbation

Although the primary aim of the present study was to examine how perturbing gesture and speech affect execution in the other modality, the experiments also revealed systematic differences in how perturbations affected the execution of the perturbed modality itself, depending on whether production was unimodal or bimodal. In Experiment 1, delayed visual feedback, prolonged gesture execution more when gestures were produced alone than when accompanied by speech. In contrast, in Experiments 3 and 4, perturbing speech, either by delayed auditory feedback (Experiment 3) or by an unexpected stimulus colour change (Experiment 4), prolonged speech execution more when speech was produced together with a gesture than when produced alone. These contrasting patterns suggest that the presence of the other modality differentially shapes execution dynamics depending on which system is disrupted. Although the exact mechanisms remain to be specified, these findings are difficult to reconcile with the ballistic account of independent execution and instead support the interactive view, in which gesture and speech remain dynamically coupled and mutually constrained throughout execution. This interpretation is consistent with recent evidence that co-speech gestures can modulate the magnitude and stability of articulatory movements, suggesting that gesture-speech coupling operates at the level of motor execution (Garvin et al., 2025).

Implications for Gesture Production Models

Gesture production models differ substantially in their assumptions about when and how gesture and speech production systems exchange information. According to the Sketch Model (De Ruiter, 1998, 2000), gestures originate from imagistic representations in working memory during the conceptualization phase of speech production. In this model, the interaction between gesture and speech occurs only during planning, and once the planning of each modality is completed, the systems act independently during execution. In contrast, the Growth Point Theory (McNeill, 1992, 2000) proposes that gesture and speech emerge from a single core idea (i.e., the growth point), resulting in a tightly integrated system in which the two modalities remain linked throughout production, including execution. Other influential models, such as the Lexical Retrieval Hypothesis (Krauss et al., 1996) and the Interface Model (Kita & Özyürek, 2003), explicitly allow interaction between gesture and speech during planning but do not specify whether or how such interaction extends into the execution phases.

Furthermore, the results from the current study further suggest that interaction during execution may depend on the type and complexity of the gesture involved. Whereas perturbation of pointing gestures in Chu and Hagoort (2014) produced closely matched delays in gesture and speech, perturbation of iconic gestures in the present study resulted in larger adjustments in gesture execution than in speech. This asymmetry may reflect differences in gesture complexity and automaticity: pointing gestures are highly practised and temporally constrained, whereas iconic gestures involve more extended and flexible motor programmes, making them more susceptible to disruptions at the execution level. Future research could test this idea by systematically varying gesture complexity to examine how it shapes the dynamics of gesture-speech interaction during execution. Such findings would further refine the interactive account by specifying how gesture-speech dynamics are affected by the motor and representational properties of different gesture types.

Limitations

The current paradigm required participants to produce highly controlled iconic depictions, which allows gesture execution to be measured with high temporal precision and enables perturbations to be applied at defined moments during ongoing movement. This methodological choice necessarily constrained gesture form and increased the degree to which movements were guided by an available visual referent in the environment (e.g., tracing shapes with a tracked fingertip cursor in VR). Such movements may be more constrained than spontaneous co-speech gestures produced in natural conversation. However, such visually anchored iconic depictions are not uncommon in everyday communication. Speakers frequently produce tracing or drawing iconic gestures, for example, when tracing the outline of an object on a table, indicating a trajectory on a map, or drawing a shape in the air while referring to a visible referent. The iconic gestures studied in the present study remain representational in nature, and they retain the core functional characteristic of those used in everyday life: the depiction of semantic content through hand movement coordinated with speech. However, whether similar interaction dynamics extend to less constrained, imagination-based iconic gestures remains an important question for future research. Furthermore, a recent study of unscripted dyadic conversations showed that coordination extends beyond manual gestures to include head movements, posture, and full-body dynamics, particularly under challenging communicative conditions such as background noise (Hládek & Seeber, 2025). Future studies should examine whether the interaction dynamics observed in the present study extend to more naturalistic communicative contexts involving coordination across multiple bodily channels, including head movements, posture, and full-body movements.

Another limitation of Experiments 2 and 4 is that stimulus changes, such as shape enlargement and colour change, required replanning of the perturbed modality, raising the possibility that the prolongation of the other modality was a consequence of these replanning rather than a disruption of the execution itself. Although this cannot be fully ruled out, it is important to note that in both experiments, the replanning occurred after execution of the other modality had already begun. This is critical because the ballistic view predicts no interaction once gesture or speech execution has been initiated, whereas the interactive view explicitly allows continued information exchange during execution, including when one modality undergoes online updating. Moreover, in Experiments 1 and 3, execution was perturbed via delayed visual or auditory feedback without requiring replanning, yet execution in the other modality was still prolonged, and the magnitude of these prolongations was positively correlated. Together, these findings are most consistent with the interactive view, which posits ongoing coordination between gesture and speech during execution.

Conclusion

Across four experiments, the present study shows that gesture and speech remain interactive during execution. Perturbing one modality consistently prolonged the execution of the other modality, even when the perturbation occurred well after the execution of the other modality had already been initiated. These effects were observed across multiple perturbation methods and cannot be explained solely by planning-level coordination. Moreover, greater prolongation of execution in one modality was associated with greater prolongation in the other, indicating continuous, bidirectional interaction. Together, these findings are inconsistent with the ballistic account in which gesture and speech proceed independently once execution begins, and instead provide strong support for the interactive view that gesture and speech continue to exchange information throughout execution. More broadly, these findings converge with recent theoretical accounts that conceptualize gesture-speech coordination as a dynamically coupled, multimodal system grounded in shared sensorimotor and biomechanical processes (e.g., Ambrazaitis & House, 2023; Momsen & Coulson, 2025; Pouw & Fuchs, 2022; Pouw et al., 2020, 2021; Prieto et al., 2025).

Supplemental Material

sj-zip-1-qjp-10.1177_17470218261450193 – Supplemental material for Gesture-Speech Interaction Beyond Planning: Evidence from Perturbations During Iconic Gesture and Speech Execution

Supplemental material, sj-zip-1-qjp-10.1177_17470218261450193 for Gesture-Speech Interaction Beyond Planning: Evidence from Perturbations During Iconic Gesture and Speech Execution by Mingyuan Chu, Hio Tong Pang, Ruoqi Zhang and Peter Hagoort in Quarterly Journal of Experimental Psychology

Footnotes

Acknowledgements

Thanks to Albert Russel and Johan Weustink for technical assistance, and Charlotte Poulisse, Nadine De Rue, and Tineke de Haan for data collection.

ORCID iDs

Mingyuan Chu

Hio Tong Pang

Ethical Considerations

This research was approved by the Ethics Board of the Faculty of Social Sciences, Radboud University.

Author Contributions

Mingyuan Chu played a lead role in formal analysis, investigation, supervision, and writing–original draft and an equal role in conceptualization and methodology. Hio Tong Pang played a supporting role in formal analysis, and writing–original draft. Ruoqi Zhang played a supporting role in formal analysis, and Peter Hagoort played a supporting role in methodology, writing–original draft, and an equal role in conceptualization.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplementary Material

The Supplementary Material is available at:

References

Ambrazaitis

House

(2023). The multimodal nature of prominence: some directions for the study of the relation between gestures and pitch accents. In Proceedings of the 13th international conference of Nordic prosody (pp. 262–273). Sciendo.

Caramazza

(1997). How many levels of processing are there in lexical access? Cognitive Neuropsychology, 14, 177–208. https://doi.org/10.1080/026432997381664

Chu

Hagoort

(2014). Synchronization of speech and gesture: Evidence for interaction in action. Journal of Experimental Psychology: General, 143(4), 1726–1741. https://doi.org/10.1037/a0036281

Chu

Kita

(2008). Spontaneous gestures during mental rotation tasks: Insights into the microdevelopment of the motor strategy. Journal of Experimental Psychology: General, 137, 706–723. https://doi.org/10.1037/a0013157

Chu

Kita

(2016). Co-thought and co-speech gestures are generated by the same action generation process. Journal of Experimental Psychology: Learning, Memory, and Cognition, 42(2), 257–270. https://doi.org/10.1037/xlm0000168

De Ruiter

J. P. A

. (1998). Gesture and speech production [Unpublished PhD dissertation, University of Nijmegen].

De Ruiter

J. P

. (2000). The production of gesture and speech. In Mc-Neill

. (Ed.), Language and gesture (pp. 284–311). Cambridge University Press. https://doi.org/10.1017/CBO9780511620850.018

Dell

G. S.

Schwartz

M. F.

Martin

Saffran

E. M.

Gagnon

D. A.

(1997). Lexical access in aphasic and nonaphasic speakers. Psychological Review, 104, 801–838. https://doi.org/10.1037/0033-295X.104.4.801

Garvin

Spradling

Franich

(2025). Co-speech gestures influence the magnitude and stability of articulatory movements: evidence for coupling-based enhancement. Scientific Reports, 15(1), 157. https://doi.org/10.1038/s41598-024-84097-6

10.

Hládek

Ľ.

Seeber

B. U

. (2025). Head, posture, and full-body gestures in dyadic conversations. arXiv Preprint, 2025, arXiv.2512.03636. https://doi.org/10.48550/arXiv.2512.03636

11.

Kendon

(1980). Gesticulation and speech: Two aspects of the process of utterance. In Kay

M. R.

(Ed.), The relation between verbal and nonverbal communication (pp. 207–227). Mouton.

12.

Kendon

(2004). Gesture: Visible action as utterance. Cambridge University Press.

13.

Kita

Özyürek

(2003). What does cross-linguistic variation in semantic coordination of speech and gesture reveal? Evidence for an interface representation of spatial thinking and speaking. Journal of Memory and Language, 48, 16–32. https://doi.org/10.1016/S0749-596X(02)00505-3

14.

Krauss

R. M.

Chen

Chawla

(1996). Nonverbal behaviour and nonverbal communication: What do conversational hand gestures tell us? In Zanna

(Ed.), Advances in experimental social psychology (pp. 389–450). Academic Press.

15.

Krauss

R. M.

Chen

Gottesman

R. F.

(2000). Lexical gestures and lexical access: A process model. In McNeill

(Ed.), Language and gesture (pp. 261–283). Cambridge University Press.

16.

Leonard

Cummins

(2011). The temporal relation between beat gestures and speech. Language and Cognitive Processes, 26(10), 1457–1471. https://doi.org/10.1080/01690965.2010.500218

17.

Levelt

W. J. M.

(1989). Speaking: From intention to articulation. MIT Press.

18.

Levelt

W. J. M.

Richardson

La Heij

(1985). Pointing and voicing in deictic expressions. Journal of Memory and Language, 24, 133–164. https://doi.org/10.1016/0749-596X(85)90021-X

19.

Levelt

W. J. M.

Roelofs

Meyer

A. S.

(1999). A theory of lexical access in speech production. The Behavioral and Brain Sciences, 22(1), 1–38. https://doi.org/10.1017/S0140525X99001776

20.

Loehr

(2007). Aspects of rhythm in gesture and speech. Gesture, 7, 179–214. https://doi.org/10.1075/gest.7.2.04loe

21.

McClave

(1998). Pitch and manual gestures. Journal of Psycholinguistic Research, 27(1), 69–89. https://doi.org/10.1023/A:1023274823974

22.

McNeill

(1992). Hand and mind. University of Chicago Press.

23.

McNeill

(2005). Gesture and thought. University of Chicago Press.

24.

McNeill

(2012). How language began: Gesture and speech in human evolution. Cambridge University Press.

25.

McNeill

Duncan

S. D.

(2000). Growth points in thinking-for-speaking. In McNeill

(Ed.), Language and gesture (pp. 141–161). Cambridge University Press.

26.

Momsen

J. P.

Coulson

(2025). Decoding prosodic information from motion capture data: The gravity of co-speech gestures. Open Mind, 9, 652–664. https://doi.org/10.1162/opmi_a_00196

27.

Morrel-Samuels

Krauss

R. M.

(1992). Word familiarity predicts temporal asynchrony of hand gestures and speech. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18(3), 615–622. https://doi.org/10.1037/0278-7393.18.3.615

28.

Pouw

de Jonge-Hoekstra

Harrison

S. J.

Paxton

Dixon

J. A.

(2021). Gesture–speech physics in fluent speech and rhythmic upper limb movements. Annals of the New York Academy of Sciences, 1491(1), 89–105. https://doi.org/10.1111/nyas.14532

29.

Pouw

Dixon

J. A.

(2019). Quantifying gesture speech synchrony. In: Grimminger

(Ed.), Proceedings of the 6th Gesture and Speech in Interaction – GESPIN 6 (pp. 75–80). Universitaetsbibliothek Paderborn.

30.

Pouw

Fuchs

(2022). Origins of vocal-entangled gesture. Neuroscience & Biobehavioral Reviews, 141, 104836. https://doi.org/10.1016/j.neubiorev.2022.104836

31.

Pouw

Harrison

S. J.

Dixon

J. A.

(2020). Gesture–speech physics: The biomechanical basis for the emergence of gesture–speech synchrony. Journal of Experimental Psychology: General, 149(2), 391–404. https://doi.org/10.1037/xge0000646

32.

Pouw

Harrison

S. J.

Dixon

J. A.

(2022). The importance of visual control and biomechanics in the regulation of gesture-speech synchrony for an individual deprived of proprioceptive feedback of body position. Scientific Reports, 12(1), 14775. https://doi.org/10.1038/s41598-022-18300-x

33.

Prieto

Esteve-Gibert

Shattuck-Hufnagel

(2025). Towards a novel conceptualization of prosody that accounts for spoken and visual signals: The modality-neutral prosodic framework hypothesis. Gesture, 23(1/2), 119–159. https://doi.org/10.1075/gest.25012.pri

34.

Rapp

Goldrick

(2000). Discreteness and interactivity in spoken word production. Psychological Review, 107(3), 460–499. https://doi.org/10.1037/0033-295X.107.3.460

35.

Rusiewicz

H. L.

(2011). Synchronization of prosodic stress and gesture: A dynamic systems perspective. Proceedings of GESPIN, 2011, 109–114.

36.

Rusiewicz

H. L.

Shaiman

Iverson

J. M.

Szuminsky

(2014). Effects of perturbation and prosody on the coordination of speech and gesture. Speech Communication, 57, 283–300. https://doi.org/10.1016/j.specom.2013.06.004

37.

Tuite

(1993). The production of gesture. Semiotica, 93, 83–105. https://doi.org/10.1515/semi.1993.93.1-2.83

38.

Wagner

Malisz

Kopp

(2014). Gesture and speech in interaction: An overview. Speech Communication, 57, 209–232. https://doi.org/10.1016/j.specom.2013.09.008

39.

Welch

R. B.

Warren

D. H

. (1986). Intersensory interactions. In Thomas

J. P

. (Ed.), Handbook of perception and human performance, Vol. 1: Sensory processes and perception (pp. 25.1–25.36). Wiley.

40.

Wittenburg

Brugman

Russel

Klassmann

Sloetjes

. (2006). ELAN: A professional framework for multimodality research. In 5th international conference on language resources and evaluation (LREC 2006) (pp. 1556–1559).

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

4.63 MB