Abstract
Repeating the movements associated with activities such as drawing or sports typically leads to improvements in kinematic behavior: these movements become faster, smoother, and exhibit less variation. Likewise, practice has also been shown to lead to faster and smoother movement trajectories in speech articulation. However, little is known about its effect on articulatory variability. To address this, we investigate the extent to which repetition and predictability influence the articulation of the frequent German word “sie” [zi] (they). We find that articulatory variability is proportional to speaking rate and the duration of [zi], and that overall variability decreases as [zi] is repeated during the experiment. Lower variability is also observed as the conditional probability of [zi] increases, and the greatest reduction in variability occurs during the execution of the vocalic target of [i]. These results indicate that practice can produce observable differences in the articulation of even the most common gestures used in speech.
1 Introduction
1.1 Improvements in kinematic behavior of hand movements
Kinematic behavior is inherently variable, such that repetitions of any single action, for example repeatedly making a particular hand gesture, will invariably exhibit variation in each instance. One source of this variability is noise in the central nervous system, which can affect sensory processes, and the activities associated with motion planning and movement execution (Bays & Wolpert, 2007; Churchland et al., 2006; Harris & Wolpert, 1998; van Beers et al., 2002). A further source of movement variability arises out of uncertainty about the target of a movement, and the degree to which the target location is predictable. The less predictable the target of a movement, for example because it shifts during testing, the greater will be the uncertainty about the target and the larger will be the variability of the movement (Georgopoulos et al., 1981; Pellizzer & Hedges, 2003).
An obvious way in which the uncertainty associated with any given movement can be moderated is through repetition and practice. In everyday life, the idea that the variability associated with a particular movement, for example swinging a golf club or selecting a note on a keyboard, can be reduced through practice is a familiar one. It is therefore not surprising that a considerable amount of laboratory work has been conducted to better understand the way that repetition influences kinematic behavior, particularly in hand movements. These studies have shown that practice of a particular movement serves to optimize the kinematic system, resulting in the execution of movement trajectories that are shorter and smoother, require less effort, and crucially, are less variable (Darling et al., 1988; Georgopoulos et al., 1981; Gribble & Ostry, 1996; Madison et al., 2013; Platz et al., 1998; Raeder et al., 2015; Segalowitz & Segalowitz, 1993; Sosnik et al., 2004; Viviani & Schneider, 1991). One way of explaining these findings is in terms of the way that practice reduces uncertainty along the trajectory of movements. As a movement is repeated, the various intermediate targets and gestures that comprise the movement as a whole become more predictable, resulting in a concomitant decrease in the uncertainty associated with each stage of movement. This raises a question. Playing instruments and swinging golf clubs are occasional pastimes. By contrast, barring a few exceptions, speech is a ubiquitous human behavior. Given that speech is effectively the most over-learned kinematic skill of all, can practice effects be observed in articulation?
1.2 Improvements in kinematic behavior of speech articulation
While there are many similarities between hand movements and articulation, there are of course differences. Beyond the fact that for most people speech will be the most practiced kinematic activity that they engage in, there are also basic structural differences in the underlying kinematic systems. Hand movements are effected with a relatively rigid body joint-angle system and usually have one target (Bourgeois & Hay, 2003). By contrast, the articulatory speech apparatus consists of a rigid bone structure (the jaw) and muscular hydrostats (the tongue, the lips) whose different parts are biomechanically joined and task-dynamically coupled (Bell-Berti & Harris, 1979; Fowler & Saltzman, 1993; Saltzman & Munhall, 1989). In speech, coupled articulators produce successive gestures, which continously aim for articulatory, sensory and acoustic targets (Browman & Goldstein, 1986; Guenther, 1995; Johnson et al., 1993). The resultant articulatory gestures are submitted to tremendous contextual variability as targets can be partially competing. This partial competition between targets results in a large overlap between adjacent gestures (Browman & Goldstein, 1986; Perkell & Nelson, 1982). Accordingly, articulatory trajectories for identical phones differ systematically depending on preceding and following phonemes due to coarticulation (Browman & Goldstein, 1986; Magen, 1997; Öhman, 1966).
These differences not withstanding, there is some evidence to support the idea that speech is affected by practice. For example, the temporal characteristics of speech production have been shown to be influenced by repetition (Fowler, 1988; Fowler & Housum, 1987), which, by definition, will inevitably result in a decrease in uncertainty and a concomitant increase in predictability. Phones and words which are more predictable—either in context or because of their frequency of occurrence—tend to have significantly shorter acoustic durations than less predictable phones and words (Aylett & Turk, 2004; Bell et al., 2009; Cohen Priva, 2015; Gahl, 2008; Ramscar et al., 2014; Tremblay & Tucker, 2011; Whalen, 1991). These effects are mirrored in articulation. Repetitions and frequency are correlated with higher articulatory velocity, smoother gestural transitions and stronger anticipatory coarticulation, which are in turn mirrored by shorter execution times (Tiede et al., 2011; Tomaschek, Arnold, et al., 2018; Tomaschek, Tucker, Baayen, et al., 2018; Tomaschek, Tucker, Ramscar, et al., submitted; Tomaschek, Tucker, Wieling, et al., 2014; Tomaschek, Wieling, et al., 2013). Unsurprisingly, given the foregoing, the temporal coordination of gestural sequences is modulated by lifelong practice (Cychosz, 2020; Green et al., 2002; Noiray et al., 2013; Rubertus & Noiray, 2018, 2020; Tomaschek, Tucker, et al., 2018).
Given that practice does appear to result in faster and smoother kinematic behavior in speech production, the question thus arises whether articulation also exhibits a reduction in variability associated with practice. That is, as articulatory gestures become more predictable, such that the uncertainty associated with them is reduced, will we see the same reduction in variability in speech that has been observed in hand movement studies?
Some initial support for this hypothesis comes from studies showing a reduction of articulatory variability as a result of lifelong, long-term practice. Studies have consistently shown that, as compared to children, adults show smaller temporal variability during anticipatory coarticulation and smaller spatial variability during vowel and consonant production (Belmont, 2011; Goffman et al., 2008; Koenig et al., 2008; Zharkova et al., 2011, 2012). This suggests that, at least insofar as during the transition from novice to expert, practice may indeed reduce articulatory variability.
In what follows, we explore whether, even in mature adult speakers, articulatory variability is similarly reduced in linguistic contexts where articulatory gestures are more predictable—and hence, by definition, well-practiced—in contrast to linguistic contexts where they are less predictable—and hence less well-practiced.
1.3 The present study
In order to investigate effects of practice and repetition in hand movements, kinematic studies typically use well defined (quasi-) parabolic pathways that participants are asked to follow with a pen (e.g., Bourgeois & Hay, 2003; Sosnik et al., 2004; Viviani & Terzuolo, 1982). Practically, however, tasks that use precisely predefined trajectories like this are impossible to implement in studies of articulation. However, it is possible to approximate this methodology through the investigation of a specific well-practiced articulatory trajectory. In this case, our trajectory of interest will be a simple vertical articulatory gesture during which the tongue dorsum moves towards a vocalic target. Examples of this simple articulatory gesture can be found in many frequent monosyllabic CV words, in the case of the present experiment, in the German pronoun sie [zi] “Engl. they”. 1 Participants produced the gestures of interest during the articulation of “[zi] + verb” phrases. The kinematics of these gestures were recorded using electromagnetic articulography. This procedure allowed us to examine multiple instances of—in the framework of articulatory phonology (Browman & Goldstein, 1986)—an identical articulatory gesture towards one invariant vocalic target (i.e., the one of /i/) while being produced in a range of contexts. These contexts in turn varied the predictability of the relationship between [zi] and the following verb, allowing the effect of predictability/uncertainty on the variability of the production of our gesture of interest to be observed.
In hand movements, the decrease of variability can occur along the trajectory (Sosnik et al., 2004) or at its end-point (Fitts, 1954). However, given the nature of speech production, articulatory gestures rarely have specifically defined end-points. Instead, articulatory trajectories are better characterized in terms of positions that the articulators consecutively aim for (Browman & Goldstein, 1986; Perkell & Nelson, 1982). Given that this means that articulatory movements tend to be continuous, we thus tested for changes in variability along the entire trajectory of the gestures of interest produced by our participants, including their vocalic targets.
2 Methods
2.1 Stimuli
The gesture of interest of this investigation is produced during the articulation of the German pronoun sie [zi] “they”. To get a measure of variability, we examined this articulation in conjunction with a total of 127 different verbs, such that the gesture of interest was always produced in a “[zi] + verb” phrase (a list of the stimuli can be found in the Appendix in Tables 3 to 6). Verbs were controlled for frequency of use, taken from the SDEWAC corpus (Faaß & Eckart, 2013; Shaoul & Tomaschek, 2013). In all of the verbs, /iː, ɪ, aː, a/ occurred as the nucleus in the first stressed syllable. 2
Participants articulated [zi] 254 times in the course of the experiment. During the course of the study some trials produced no data due to mispronounced articulations or sensors that were not correctly tracked. As a result, the final data set contained on average 184 [zi] instances per participant (min = 107, max = 242). In the analyses that follow, only verbs with a single non-dorsal onset consonant were considered. In total, 3117 tokens were analyzed, consisting of 72 verb types beginning with a coronal consonant, and 55 verb types beginning with a labial consonant.
2.2 Speakers and recording
A total of 21 German native speakers were recorded and paid 15€ for their participation. All speakers provided informed consent before participating in the experiment, and speaker identity was anonymized. Four speakers had to be excluded from the analysis due a large proportion of missing data or faulty sensors. Of the remaining speakers in the analysis, nine were female and eight were male, with mean age 25.6 years (SD = 3 years).
All recordings were conducted in a sound-attenuated booth in the Department of Linguistics at the University of Tübingen. Speakers were instructed to read the stimuli ([zi] + verb) aloud after they appeared on a computer screen. The list of stimuli was pseudo-randomized for each participant and separated into three recording blocks. Each block was presented once in a slow (odd blocks, inter-stimulus-time: 600 ms; presentation-time: 800 ms) and once in a fast speaking condition (even blocks, inter-stimulus-time: 300 ms; presentation-time: 450 ms). In total, six blocks of stimuli were presented. Presentation order in slow and fast blocks was also pseudo-randomized. Crucially, each stimulus ([zi] + verb) was presented only twice, once in a slow and once in a fast condition.
Participants’ tongue movements during articulation were recorded with an NDI wave articulograph at a sampling frequency of 100 Hz. Contemporaneous audio recordings of the articulations were also made (22.05 kHz, 16bit) and synchronized with the recordings from the articulograph. To correct for head movement and to define a local coordinate system, a special 6D reference sensor was attached to the speakers’ forehead.
Before sensors were attached to a participant’s tongue, a bite plate recording was made to determine the head’s rotation in relation to the magnetic emitter. To the bite plate three sensors in a triangular configuration were attached, which represent a local reference for a standardized coordinate system. During the experiment, tongue movements were captured by three sensors: one slightly behind the tongue tip, one at the tongue middle and one at the tongue dorsum (distance between each sensor: approximately 0.5cm to 1cm).
The verbs recorded in the present study contain apical consonants that inhibit potential effects of variability on the tongue tip. Therefore, we investigated only tongue dorsum movements. Since horizontal tongue dorsum movements were minimal (< 1 mm) in the present data, we focused our investigation on movements in the vertical dimension.
2.3 Preprocessing
The effects of head movement on tongue movements were corrected for using an online procedure during recording. The recorded positions of the sensors were centered at the midpoint of the bite plate and rotated. As a result, the back-front direction of the tongue was aligned to the horizontal axis with more positive values towards the front of the mouth, and more positive vertical values towards the top of the oral cavity. Word boundaries were determined by automatically aligning the audio signal with phonetic transcriptions using a Hidden-Markov-Model-based forced aligner for German (Rapp, 1995), and manually verified and corrected where necessary.
3 Analysis
Having described the motivation, methods, and materials for the present study, we next discuss the variables that were used to predict articulatory movements, and lay out the statistical methods that were used in the analysis of these variables.
3.1 Movements across time
The very fact that the tongue movement will change its position as a function of time is inherent to speech production. It follows that any analysis of articulatory variability must inevitably be made in relation to changes in the tongue’s position relative to time. To do this, we performed a time-course analysis of articulatory movements.
Given that articulatory rates are themselves variable, both within and across speakers—in the present dataset, durations for the gesture of interest ranged between 70 ms and 300 ms—it follows that any analysis of the variability of tongue movements in relation to specific gestures can never be a straightforward process. When trajectories from words with different durations are aligned at their onset, parts of the trajectory, such as the transition between [z] and [i] and the location of the maximum deflection of the tongue dorsum during [i], are located at different time points. This is illustrated in Figure 1 (a) for three different durations of [zi] (color coded), with time depicted at the x-axis and tongue height at the y-axis. Accordingly, articulatory trajectories from words with varying durations cannot be analyzed without mapping these different parts onto one another.

Illustrations of the articulatory trajectories of the tongue dorsum taken from [zi] utterances with three different durations (color coded) across original time (a) and across normalized time (b). (c–d) Illustrations of the data for one of the speakers in the slow speaking rate condition depending on the vowel in the following verb. Gray circles represent recorded positions, black lines represent average trajectories. Y-axis represents tongue dorsum heights, x-axis represents normalized times in the word [zi].
In order to achieve a mapping between different parts of the trajectories and to be able to control for duration differences between individual [zi] instances during statistical analysis, time normalization had to be performed. This was achieved by normalizing timestamps for the recorded articulatory positions to a [0, 1] interval, with 0 linked to vowel onset and 1 to vowel offset. This is illustrated in Figure 1 (b). In what follows, we refer to the timestamps for each recorded position normalized for the duration of [zi] as Time.
Figure 1 (c–d) illustrates how the recorded positions change across time for [zi] articulations preceding verbs with [iː] stem vowels (c) and verbs with [aː] stem vowels (d) across normalized time. The black lines in Figure 1 represent the average change in vertical position of the tongue dorsum across time. The tongue dorsum movement trajectory can be described as s-shaped across time, with the tongue dorsum located at a low position at the onset of [zi] and steadily rising towards the offset of [zi]. As can be seen, there are systematic differences in amplitude between the two upcoming vowels in the verbs: the articulatory trajectory during [zi] has a more pronounced amplitude with a lower onset, and a higher peak and offset when preceding verbs with [a] than when preceding verbs with [i] as stem vowels.
The goal of the present study was to investigate how strongly articulatory trajectories vary across [zi] as a function of predictability and repetition within the experiment. In order to achieve this goal, we controlled for a range of intrinsic and extrinsic effects that are known to modulate average amplitude and shape of articulatory trajectories (represented with the black line in Figure 1). In the next section, we describe the predictors to capture these intrinsic and extrinsic effects as well as the predictors of interest. After this, we then introduce the statistical techniques that are used to analyze the data.
3.2 Predictors of interest
The participants in the present study articulated [zi] multiple times during the course of the experiment. This allowed us to investigate how the repetition of a highly practiced articulatory gesture within a relatively short time span might affect its variability. We operationalized repetition during our experiment by means of WordRepetition (z-scaled). We predicted that movement variability would become smaller due to short-term practice during the experiment. At the same time, this predictor was also used to control for the effects of repetition on the amplitude of the average articulatory trajectory in line with the findings by Tiede et al. (2011).
The contextual uncertainty of the articulatory gesture of [zi] was operationalized by the conditional probability of [zi] given the following verb, in other words, the inverse conditional probability of [zi]. Inverse conditional probability is an explicit measure of expectancy. By contrast, measures such as the verb’s frequency or the phrase’s bigram frequency alone are not (Baayen, 2001; Shannon, 1948). From a learning perspective, inverse conditional probability is also an index of the prediction error associated with the following verb given [zi] (Aizenberg et al., 2000; Daw et al., 2008; Dayan & Daw, 2008; Dechter, 1986; Hannun et al., 2014; Ng & Jordan, 2002; Schultz, 2006; Schultz et al., 1997; Sutton & Barto, 1981). Since prediction error of this kind is commonly used to explain learning at both, the behavioral and neural-biological level, it can be taken as an index of how our participants likely acquired their sensitivity to variance in the context of [zi]. This perspective is supported by studies in which conditional probability has been shown to account for behavior better than less informative measures (Arnon & Snider, 2010; Bannard & Matthews, 2008; Bell et al., 2009; Tremblay & Baayen, 2010). This is why the conditional probability measure was favored over the phrase’s joint probability (i.e., the “[zi] + verb” bigram frequency). We calculated the conditional probability of [zi] given the following verb using Equation 1 (from now on SieProbability).
Because the SDEWAC corpus (Faaß & Eckart, 2013) did not contain all the phrases used in the current study, the number of Google search hits was used to obtain
3.3 Control predictors
According to Fitts’ Law (Fitts, 1954), movement variability is proportional to movement time. This effect has repeatedly been found for hand movements (Kim et al., 1999; Schmidt et al., 1979; Sosnik et al., 2004; Viviani & Schneider, 1991). Accordingly, we expected articulatory variability to be proportional to the duration of [zi], which we operationalized as the log-transformed and z-scaled measure SieDurations. Note that SieDurations and Time are independent of each other. Time predicts the time-course of the tongue’s position, SieDurations predicts changes in the shape of the trajectory in relation to the total duration of the movement. Since our stimuli were recorded under two different speaking rate conditions, we also used the factorial predictor SpeakingRateCondition, with the levels fast and slow.
The current experimental set up also allowed for the investigation of an additional linguistic variable, namely anticipatory coarticulation. This is a consequence of the fact that [zi] was produced in an unstressed position of a large number of “[zi] + verb” phrases, where the verb contained either a high /iː/ and /ɪ/ or a low /aː/ and /a/. Since unstressed syllables undergo anticipatory coarticulation of the upcoming syllable (Fowler & Brancazio, 2000; Hoole et al., 1993; Magen, 1997; Öhman, 1966; Recasens, 1984; Sziga, 1992; Tomaschek, Tucker, et al., 2018; Tomaschek et al., 2014; Tomaschek, Wieling, et al., 2013), we expected the tongue to change movement and variability patterns depending on the upcoming vowel.
Pilot analyses indicated that the average articulatory trajectory varied with the vowel category. Since the process of articulation is gradual rather than discrete, a gradient measure was used to account for the effect of the upcoming vowel.
3
To do so, the measure TargetDistance was obtained by calculating the vertical difference between the tongue position in the center of [i] in [zi] and the tongue position in the center of the verb’s stem vowel. Manual inspection indicated that this predictor exhibited a broad distribution of values, which allowed it to be included into the model. Model comparisons revealed that TargetDistance provided a better fit to the data (a reduction of the ML-score,
Finally, Fitts’ Law (Fitts, 1954) also predicts that movement variability is proportional to a limb’s traveled distance. To account for a possible similar effect in tongue movements, we calculated the total distance that the tongue dorsum has traveled in the Euclidean space during the articulation of [zi] (TraveledDistance).
Given previous findings from studies of articulation, we expected that the average tongue trajectory across Time would be shallower in relation to greater WordRepetition as a result of fatigue effects, and shallower in relation to shorter SieDurations as a result of undershoot. We predicted a similar effect for TargetDistance, where average tongue trajectories were expected to be shallower when TargetDistance was greater, because this would result in increased anticipation of the vowel in the verb. 4 Finally, we expected that standard deviations along the average tongue trajectory will be smaller in relation to greater WordRepetition due to practice during the experiment, smaller in the slow SpeakingRateCondition, smaller with smaller TargetDistance, and smaller in relation to smaller TraveledDistance.
3.4 Modeling strategy
To better understand any variability observed around the articulatory trajectory across [zi], we exploited the fact that standard regression models estimate the mean of the data and also provide the residuals, that is, the difference between the mean and the raw data. In Figure 1, residuals are the vertical differences between the black line and the gray circles. Studies by Chodroff and Wilson (2017), Sonderegger (2015) and Sonderegger et al. (2017) have modeled variability by fitting a standard regression model, extracting the residuals from that model and subjecting them to a second regression analysis.
We used this approach in a top-down, bottom-up modeling strategy to obtain a final model for the average tongue trajectory and a final model for the residuals. These two models were subsequently merged and fitted simultaneously with a Generalized Additive Mixed-Effects model (GAMM, package
While linear mixed-effects models model linear functional relations between a response and a predictor, GAMMs model non-linear functional relations between a response and a numeric predictor. This is accomplished by means of smooths. In this way, GAMMs allow us to investigate the non-linear time-course of the average articulatory trajectory that can be seen in Figure 1. Further details of the mathematical underpinnings of GAMMs, as well as their use in analyzing time-dependent non-linear data can be found in Wieling et al. (2016) as well as in Tomaschek, Tucker, et al. (2018), Tomaschek et al. (2014), Tomaschek, Wieling, et al. (2013), Kryuchkova et al. (2012), Nixon, et al. (2016) and Tomaschek, Arnold, et al. (2018). Non-linear interactions between Time and potential covariates were modeled with tensor product smooths. Tensor product smooths explain the relation between the dependent variable and a non-linear interaction between two covariates by means of estimating a wiggly surface that best predicts the data.
We controlled for random effects on the average trajectory, using random factor smooths, that is, smooths for random effects that can be thought of as non-linear equivalents to a combination between random intercepts and random slopes in mixed-effects regression. These included random factor smooths across Time per Speaker, controlling for inter-speaker variability during the production of [zi] (see Fuchs et al., 2008; Tomaschek & Leeman, 2018; Weirich & Fuchs, 2006 for spatio-temporal articulatory variation due to physiological differences), and random factor smooths across Time per Verb, controlling for variation in tongue position due to near and far neighboring consonants (Fowler & Brancazio, 2000; Hoole et al., 1993; Magen, 1997; Öhman, 1966; Recasens, 1984; Sziga, 1992; Tomaschek et al., 2014; Tomaschek, Wieling, et al., 2013). Although SieProbability was not significant for tongue height in pilot analyses, we included a random factor smooth for SieProbability per Speaker in order to control for possible by-participant effects as a function of SieProbability.
The final model had
Having discussed the details of our analysis, we turn in the following section to the results. We first present the results of the model terms fitting the average time-course of the tongue dorsum. We describe how the GAMM model should be interpreted, as well as describe how GAMM plots should be read. Following this, we discuss the results of the model terms fitting the standard deviation around the average trajectory.
3.5 Effects of covariates on the average tongue dorsum trajectory
The average tongue trajectory was fitted by means of two three-way interactions. By means of the interaction Time × SieDurations × SpeakingRateCondition, we investigated how the shape of the tongue dorsum’s movement trajectory across time was modulated by differences in [zi] duration in slow and fast SpeakingRateCondition. By means of an Time × TargetDistance × SpeakingRateCondition interaction, we investigated how tongue movements across time varied due to anticipatory coarticulation in the two SpeakingRateConditions. By means of one two-way interaction, Time × WordRepetition, we investigated how tongue movements across time were modulated by WordRepetition. Pilot analyses indicated that SieProbability did not emerge as a significant predictor of the average articulatory trajectory. Also, we found that the average tongue height did not significantly differ between the fast and the slow SpeakingRateCondition (
Table 1 shows the summary of the effects for the average trajectory in the final model. Part (A) reports the intercept (in mm) of the model. Part (B) reports the non-linear effects in the model. Interactions were fitted with combinations of main effects, using smooths “s()”, and partial tensor product smooths “ti()”. This method is equivalent to the
Summary of partial effects in the GAMM model, fitting the
Estimated degrees of freedom (“edf”) of smooths for Time, SieDurations, TargetDistance in both SpeakingRateConditions are larger than 1, indicating that all these effects had a non-linear functional relationship with the position of the tongue dorsum. This was also the case for WordRepetition. Furthermore, the shape of the tongue movement changed across time due to the predictors, as indicated by the significant partial interactions between Time and SieDurations, TargetDistance and WordRepetition. All random factor smooths were significantly non-linear, as can be seen at the bottom of Table 1.
3.6 Estimated average trajectory
To understand the effects of variability, we first need to understand the time-course of the tongue dorsum sensor during [zi] as it was modulated by the interaction between the predictors TargetDistance, SieDurations and WordRepetition.
GAMM interactions are illustrated in a different way to standard regression plots. Estimates of the dependent variable obtained in a linear model are typically depicted on the y-axis. To illustrate interactions in a GAMM model, the shape of the estimated surface (in our case, tongue position across time modulated by another predictor) is illustrated by means of surface plots (Figure 2).

In Figure 1, the movement of the tongue was represented as a black line. By contrast, in a surface plot, movement can be represented in a way that is more akin to a geographical map, which often use color to represent the features of a terrain as a function of its geological coordinates. In the present surface plots, this color coding represents the estimated height of the tongue dorsum as function of time (depicted on the x-axis) in interaction with another covariate (depicted on the y-axis). Dark blue colors represent that the tongue dorsum is low at a given time point, as it is the case at the onset of [zi]; light yellow areas represent that the tongue dorsum is high, as it is the case in the offset of [zi]. Green represents tongue positions that are in between.
As can be seen in Figure 2, the movement of the tongue evolves from left to right, with the onset of the trajectory linked to the left edge, and its offset linked to the right edge. Modulations of tongue height by covariates are illustrated by changes in the color patterns. As in geographical maps, where areas of the same elevation are represented by one contour line, contour lines in the surface plots connect areas of the same tongue dorsum height on the estimated surface, irrespective of Time and the interacting predictor.
Figure 2 (a–b) illustrates how TargetDistance (depicted on the y-axis) modulated the tongue dorsum positions during the articulation of [zi] in both SpeakingRateConditions. The horizontal black dotted lines represent the average tongue height positions in the verbal stem vowel. When the vocalic targets in the stem of the following verb were higher, the gesture made by the tongue dorsum at the onset of [zi] was articulated at a lower position than when the vocalic targets in the stem were lower. This is illustrated by deeper blue colors at the onset, as TargetDistance increases. The deflection at the vocalic target (roughly at a time point of 0.8) was shallower when TargetDistance was reduced. Öhman (1966) observed a similar pattern of “anti-anticipation.” In his study, lowering of F1 frequencies of [y] became more pronounced preceding [a] than when preceding [y] (see also Magen (1997) for a similar finding in one of the speakers). One possible explanation for this finding might be that anticipation of the vowels in the verb is modulating the tongue’s palatal bracing during [zi] articulation. Palatal bracing has not only been observed in [i], but also in [z] (McAuliffe et al., 2001; McLeod et al., 2006; Stone & Lundberg, 1996). Studies of anticipatory coarticulation have also shown increased palate bracing for [i] relative to [a] (Recasens, 1984, 1990). Greater palate bracing results in stronger lowering of the tongue dorsum due to a more pronounced concave shape of the tongue. This suggests that palate bracing during [zi] was greater when the verb contained an [i], than when the verb contained an [a], although this hypothesis has yet to be tested empirically.
With regards to SpeakingRateCondition, the two conditions appear to have modulated the time point of the onset of the transition between [z] and [i]. It appears that this effect depends on WordRepetition, with the transition beginning earlier in the fast condition than in the slow condition.
The effect of SieDuration is illustrated in Figure 2 (c–d). The amplitude of the trajectory became larger and more distinct in longer words than in shorter words. Lower tongue dorsum positions were observed in the onset of [z], while higher tongue dorsum positions were seen in the [i]. Further, the tongue dorsum was articulated at lower positions in the fast than in the slow SpeakingRateCondition. This potentially reflects an effect of reduction due to a shorter articulation window during the experiment.
Figure 2 (e) illustrates the effect of WordRepetition. As can be seen, tongue dorsum positions across the entire [zi] articulation become lower—that is, more centralized—as the experiment progresses. This is likely to reflect a standard finding, namely fatigue due to repetition.
3.7 The variability around the average trajectory
Having discussed the control effects of speaking rate, word duration, anticipatory coarticulation, and repetition during the experiment on the average trajectory, we now turn our attention to our main hypothesis: examining whether the effects of practice are detectable in articulatory variability. Recall that we assessed the variability by means of the standard deviation term in the gaulss GAMM, which represents the absolute deviation from the average trajectory.
The standard deviation in the model was fitted with a main effect for SpeakingRateCondition, and individual smooths for WordRepetition, TraveledDistance and VowelProportions. We included a main tensor product “te()” for a Time×SieProbability interaction, fitting the main effects as well as the interaction simultaneously, and a partial tensor product “ti()” for a Time×TargetDistance interaction.
The main effect for TargetDistance is missing, because in pilot analyses, it was found that it was not significantly predictive of standard deviation. The inclusion of SieProbability and the inclusion of the Time×SieProbability interaction into the standard deviation model improved model fit, as supported by a decrease in AIC (∆= 156.1/ 13.8). Due to run time reasons, this was tested with a simpler model containing only a smooth for a Time×SpeakingRateCondition for the average tongue dorsum trajectory and random factor smooths across Time by Participants. It is noteworthy that the effect of SieProbability was also present without the linguistic control predictors in both, the term fitting the average trajectory and the term fitting the standard deviations. We found that standard deviation around the average trajectory was significantly larger in the fast than in the slow SpeakingRateCondition (Part (A) of Table 2). This finding is in line with Fitts’ Law that movement variability is proportional to movement velocity (Fitts, 1954).
Summary of partial effects in the GAMM model, fitting the
The non-linear effects of the standard deviation terms are shown in part (B) of Table 2. We found that standard deviation had a significant functional non-linear relation with TraveledDistance and WordRepetition. No significant interactions with time were found for the present data. This means that TraveledDistance and WordRepetition do correlate with variability during the entire movement trajectory, rather than only parts of it. The significant Time × SieProbability interaction indicates that standard deviation around the average trajectory changed along the movement’s time-course in relation to SieProbability. The significant partial tensor smooth Time × TargetDistance showed that this was also the case for TargetDistance. We also included random intercepts for Speaker and Verb. Figure 3 illustrates the estimated effects of standard deviation.

Partial effects for the
Standard deviations decreased for the first quantile of TraveledDistance, but increased steadily in the second to fourth quantile (Figure 3 (a)). This last effect mirrors the predictions of Fitts’ Law, namely that variability increases with the distance between onset and target of the movement (Fitts, 1954). Note that TraveledDistance significantly correlated with WordDuration (Spearman’s rank correlation ρ = 0.51) and pilot analyses indicated that WordDuration had a similar effect on standard deviation.
Figure 3 (b) illustrates the partial effect of WordRepetition. The vertical gray dotted lines represent the average boundaries between the recording blocks. As can be seen, standard deviation decreases in the first two recording blocks, and then increases minimally in blocks three and four, before decreasing again in the last two blocks. 6 Overall, this pattern indicates a reduction of variability across the experiment. This is consistent with our hypothesis that the repetition of one and the same tongue gesture would result in the same decrease in variance that has been observed in other kinematic studies (e.g., Sosnik et al., 2004). That is, consistent with what has been found for speed and smoothness (Tiede et al., 2011), short-term practice significantly also reduces articulatory variability. Given that we observed only a significant main effect for WordRepetition, these findings also seem to indicate that this decrease in articulatory variability as a result of practice occurs along the entire [zi] trajectory.
The effect on standard deviations of the interaction between Time and SieProbability are illustrated by means of a surface plot in Figure 3 (c). Changes in standard deviation are also color coded: dark blue colors represent a decrease of standard deviations, bright yellow colors represent an increase of standard deviations. It is possible that the exact position of the tongue dorsum during the production of [z] is less relevant than during the production of [i]. Accordingly, articulatory variability was greater at the onset of [zi], that is, during the [z] part, than in the center when the dorsum aimed for the vocalic center in [zi] (around 0.8 normalized time).
The observed reduction in variability was modulated by SieProbability as follows: In the onset of [zi], only a marginal effect of SieProbability can be observed. Towards the offset, as SieProbability increased, the variability around the average trajectory became significantly lower (mirrored by deeper shades of blue). The minimum of standard deviation across time coincides with the maximum deflection of the tongue dorsum movement, as observed in Figure 2. In other words: When the probability of [zi] was high, articulatory variability during the [i] target location was low and vice versa.
Finally, we turn our attention to the partial effect of TargetDistance (Figure 3 (d)). When [zi] was followed by verbs with high vowels, standard deviations decreased only slightly towards the onset of [zi] and increased only slightly towards the offset of [zi]. The effect was reversed when [zi] was followed by verbs with low vowels: standard deviations increase towards the onset of [zi] and decrease towards the offset of [zi]. One possible explanation of this difference in articulatory variability might be how the shape of the gesture’s trajectory depends on the stem vowel in the upcoming verb. When the tongue body moves from [z] to [i] to a low vowel in the verb’s stem, it describes an arch, whose peak is located at the time point of [zi]’s vocalic target (Iskarous, 2005). This is not the case with high vowels in the verb’s stem. Instead, the tongue body describes a plateau when moving towards high vowels (see Figure 2 (a–b)). The downward movement towards low vowels potentially constraints the tongue’s movement path which results in a reduction of variability at the offset of [zi]. By contrast, the plateau-like movement towards high vowels is less constrained, allowing for more variability.
4 Discussion
Practicing hand movements, that is, repeating a particular movement, makes the targets of the movement trajectory more predictable. Studies on movement kinematics have shown that this leads in turn to a reduction in the variability around the movement trajectory and in the target area (Georgopoulos et al., 1981; Pellizzer & Hedges, 2003; Sosnik et al., 2004).
Following from hand movement kinematics, we hypothesized that a corresponding effect of practice would be observed in articulatory variability. We operationalized practice both in terms of the repetition of gestures during an experiment, and their predictability based on the frequency with which speakers use articulatory gestures in speech. We tested our predictions by measuring movements of the tongue dorsum during the articulation of the German word sie [zi] “Engl. they” in multiple “[zi] + verb” phrases. Given that these predictions were supported by our results, we now turn to their theoretical implications.
The pronoun [zi] is one of the 50 most frequent German words (Arnold & Tomaschek, 2016; Faaß & Eckart, 2013; Shaoul & Tomaschek, 2013). It is thus clearly a highly practiced articulatory gesture for any native speaker of German. Therefore, one might reasonably ask whether the gestures associated with this word should even be susceptible to observable effects of predictability and repetition. There are a number of reasons why this might actually be the case.
Articulation has been shown to be influenced both by the words and gestures that a speaker previously articulated, and by the upcoming words and gestures that speakers will articulate. This is evidenced by findings on coarticulation (Katsika et al., 2015; Magen, 1997; Öhman, 1966; Whalen, 1990; Whalen et al., 2015) and systematic changes in the acoustic characteristics of words in relation to relative informativity of preceding words and the uncertainty associated with following words (Aylett & Turk, 2004, 2006; Bell et al., 2009; Tremblay & Tucker, 2011).
Learning offers a natural and parsimonious explanation of the effects of practice. In behavioral, neural, and even engineering models (Daw et al., 2008; Ng & Jordan, 2002; Rescorla & Wagner, 1972; Schultz, 2006; Schultz et al., 1997; Sutton & Barto, 1981), learning is normally formalized in terms of the way that experiences serve to influence the uncertainties associated about observable events (see discussions in Ramscar et al., 2010, 2013). These models treat learning as a process that is very sensitive to the success or failure of expectations, that is, whether or not the outcomes predicted by cues in earlier events actually occur.
Along with linguistic regularities that have been learned and encoded at higher levels of abstraction, from this perspective, words can be seen as acoustic cues which are informative to the speaker and the listener about the words that will be articulated after them. In addition to a word’s acoustics, the speaker can use its articulatory gestures as an additional set of cues informative about following words. As speakers conceptualize what they are going to say and plan the words that will be articulated, their abstract acoustic and gestural targets may also serve as cues for ongoing articulations. From this perspective, in our experiment, verbs serve as a source of uncertainty for [zi].
Since the frequencies of gestures, words, and other linguistic regularities vary systematically in natural speech (Linke & Ramscar, 2020; Zipf, 1935), it follows that the amount of information they provide will vary systematically as well. This kind of systematic variation can be seen in the identities and probabilities of the words that occur before the verb. Not only does German syntax allow word orders in which pronouns can occur in front of or after verbs, which goes along with high variability in the identity of words preceding the verb. Even in the “pronoun + verb” order, the verb forms used in the present experiment can also be preceded by the pronoun [viɐ] “Engl. we”. It thus follows from this perspective that the uncertainty that occurs at the pronoun will vary considerably.
Thus, although [zi] is a highly frequent (i.e., probable) word—as are its gestural targets—when considered in isolation, in context its predictability will be subjected to a considerable amount of systematic variation. This variation will in turn affect the degree of uncertainty that is associated with the articulatory gestures that will have to be made in order to articulate it. This uncertainty will in turn be proportional to the amount of practice that a speaker gets with making any given gesture in context.
This then raises a question: what kind of mechanism might explain the relationship between practice and the precision with which gestures are articulated? According to articulatory phonology (Browman & Goldstein, 1986, 1989, 1992), articulatory gestures are defined by invariant tract variables which specify an articulator’s degree of constriction, and its onset and offset timing relative to prosodic cycles. Contextual variation arises due to the phasing and blending between consecutive articulatory gestures. Effects of predictability and uncertainty such as those reported in the present study are not expected. In order to incorporate the current results into this framework, context sensitive constriction information, and potentially context sensitive timing information should be available in the lexicon.
In his DIVA model, Guenther (1995) proposed that articulators aim for sensory target areas. The size of target areas is proportional to speaking rate. In slow speech, target areas are smaller than in fast speech, resulting in smaller articulatory variability. A similar kind of mechanism could be responsible for the reduction of articulatory variability associated with practice and lower uncertainties about articulatory gestures.
Accordingly, it follows that from a learning perspective, when we observe the systematic patterns of articulatory variance even for highly practiced words like [zi], rather than being faced with accounting for how speakers acquire the knowledge that produces this variance, we must rather try to explain why it would be surprising if these systematically variable patterns of practice did not result in correspondingly systematic patterns in behavior (see e.g., Ramscar et al., 2014, 2017).
From the perspective we have so far described, learning is a discriminative process (Ng & Jordan, 2002) which serves to reduce the uncertainty associated with observations of and behavior in the world. The reduction of uncertainty emerges through dynamic interactions between learning, that is, the reinforcement of the relationships between cues and associated outcomes, and the unlearning of these relationships as the result of prediction error, that is, when the predicted outcome did not occur (Ramscar et al., 2010, 2013).
With regard to the current study, we can conceptualize the effects of learning in this way along the following lines: Whenever speakers articulate a “[zi] + verb” phrase in which the verb is highly likely to be preceded by [zi], the articulatory gestures of [zi] will be strongly reinforced in relation to the verb. Simultaneously, these particular articulatory gestures will be unlearned for all other contexts, especially in relation to verbs which are less likely to be preceded by [zi] (see Ramscar et al., 2013, for detailed description of this process). Whenever speakers articulate a “[zi] + verb” phrase in which the verb is very unlikely to be preceded by [zi], the reinforcement of the [zi] gestures will be relatively weak, as will be the unlearning for all the contexts in which [zi] is highly likely to occur. Critically, we are assuming that our participants’ prior practice of the gestures they use to form [zi] did not occur in a vacuum, but rather that this practice occurred in and depended on its context. It also follows that this reasoning can be applied to similar contextual effects on the articulation of other words.
Seen like this, the contextual differences we actually observed were not merely unsurprising. They were in some sense inevitable, since these effects are predicted by virtually all behavioral and neuro-biological models of learning (Aizenberg et al., 2000; Daw et al., 2008; Dayan & Daw, 2008; Dechter, 1986; Hannun et al., 2014; Schultz, 2006; Schultz et al., 1997; Sutton & Barto, 1981). Moreover, this logic is also consistent with the results of numerous studies that have shown that speech production is not executed on a phone-by-phone and word-by-word basis but rather that anticipatory effects modulated by uncertainty can be found everywhere. For example, findings in studies of coarticulation (Katsika et al., 2015; Öhman, 1966; Ostry et al., 1996; Sziga, 1992; Tiede et al., 2011; Tomaschek, Tucker, et al., 2018; Whalen, 1990; Whalen et al., 2015), in studies of phonetic duration of single segments (Cohen Priva, 2015; Tomaschek, Plag, et al., 2019) and larger sequences (Bannard & Matthews, 2008; Shaoul et al., 2013; Tremblay & Tucker, 2011) are very likely to be reflecting the same underlying learning processes.
Our data clearly show that variability is proportional to uncertainty. As uncertainty decreases, variability decreases; as uncertainty increases, variability increases. Nevertheless, we should sound a note of caution with regard to the sources of increased variability. The nature of linguistic distributions guarantees that as the frequency of lexical events decreases, levels of individual practice will increasingly vary (Ramscar et al., 2014). To give a trivial example, the word “formant” is a common word in the vocabularies of most readers of this paper, yet it is likely that the majority of the population are entirely unfamiliar with this word. This means that the increased variability we see with lower probability events could reflect either the increased variability we would expect across individuals because of the distributional facts. Or else it might reflect increased variability within individuals as a result of diminished practice (as is the case at the beginning of our experiment). It is very likely that both forces are at work. We leave it to future studies to disentangle their relative contributions to the variability we observe in articulation.
Footnotes
Appendix
Acknowledgements
We would like to thank the associate editor and two anonymous reviewers for their invaluable comments on previous versions of the manuscript that led to improvements in the exposition. Our thanks also go to our student assistants Mathias Müller from the University of Zürich and Josh Ring from the University of Tübingen for their inexhaustible willpower while correcting all instances of [zi].
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded in part by the Alexander von Humboldt Chair awarded to R. H. Baayen (grant 1141527) and by the Deutsche Forschungsgemeinschaft (Research Unit FOR2373 “Spoken Morphology,” Project “Articulation of morphologically complex words,” BA 3080/3-1).
