Abstract
Speech fluency has been extensively researched as a core construct for second language (L2) speaking assessment. Despite the broad consensus on its multifaceted nature, few researchers have empirically explored the dimensionality of this construct. Operationalizations of fluency vary across research and practice, using both holistic and fine-grained features. To address the dimensionality of speech fluency, in this study we examined an array of fluency features of speaking performances on the Aptis test. We conducted both univariate and multivariate statistical analyses to investigate the relationship between individual fluency features and language proficiency, as well as the relationships among fluency, complexity, and accuracy features. We found differences in the holistic and fine-grained fluency features, suggesting that they might reflect different dimensions of speech fluency and be associated with different components of language proficiency. Based on the findings, we labeled these two types of fluency features as macro and micro fluencies. Whereas macro fluency features tend to entail a holistic representation of fluency, micro fluency features tend to be more closely related to the automatic processing of lexico-grammar, constituting a more direct reflection of the cognitive processes in speech production. The findings support the multidimensionality of speech fluency and the need to include both macro and micro fluency features in the scoring, scale development, and validation of L2 speaking assessment.
The rise of communicative language teaching and testing has expanded researchers’ and practitioners’ focus from accuracy to complexity and fluency (Chambers, 1997; Fulcher, 2000). This movement has brought the discussion of speech fluency to the forefront, making it a core construct in second language (L2) speaking assessment. Historically, speech fluency has been conceptualized in both broad and narrow senses (Lennon, 1990). In the broad sense, fluency is synonymous with overall language proficiency. In the narrow sense, fluency refers to temporal features of speech, which can be further divided into two dimensions: rapidity and smoothness (i.e., lack of disfluency). Skehan (2003) further divided the narrow sense of fluency into speed fluency, breakdown fluency, and repair fluency, where speed fluency refers to the rapidity of speech and the breakdown and repair fluencies refer to the disfluency or smoothness of speech. Segalowitz (2010) provided a slightly different perspective to conceptualize fluency, where he made a distinction among three notions of speech fluency, namely, cognitive fluency, utterance fluency, and perceived fluency. Cognitive fluency is “the efficiency of operation of the underlying processes responsible for the production of utterances,” utterance fluency is “the features of utterances that reflect the speaker’s cognitive fluency,” and perceived fluency is “the inferences listeners make about speakers’ cognitive fluency based on their perceptions of their utterance fluency” (p. 165). Segalowitz’s definition of cognitive fluency is closely associated with automatic processing, an important aspect of second language proficiency (Segalowitz, 2007; Segalowitz & Hulstijn, 2005). Cognitive fluency is difficult to measure as the interpretation of the construct is less straightforward than utterance fluency, but utterance fluency can be objectively measured through an array of temporal variables. In language testing literature, fluency investigations tend to relate utterance fluency features to proficiency, often operationalized through test scores awarded by human raters (e.g., Ginther et al., 2010; Iwashita et al., 2008). However, as de Jong (2018) pointed out, be it cognitive fluency or overall proficiency, research mapping utterance fluency features to higher-level constructs of language proficiency is still relatively sparse. Additionally, despite the wide array of utterance fluency features used in language studies, the dimensionality of utterance fluency remains underexplored. In this study, we operationalize language proficiency as test scores awarded by human raters, and we investigate an array of utterance fluency features of speaking performances on the Aptis test, in an effort to explore the following: (1) how these features correlate with language proficiency; (2) whether there are different dimensions underlying these fluency features; and, if so, (3) how these dimensions correlate with overall and various subcomponents of language proficiency.
Operationalization of speech fluency: Holistic and fine-grained features
Despite the broad conceptual consensus, empirical research on speech fluency tends to focus on different subsets of utterance fluency features. In general, there are two strands of fluency investigations in applied linguistics. The first strand examines fluency explicitly within the framework of linguistic complexity, accuracy, and fluency (henceforth CAF), as an effort to elucidate the nature of language proficiency and language development (Skehan, 2001); this line of research tends to employ holistic fluency features, mostly speed features and sometimes pausing features, together with complexity and accuracy features, in order to gauge the overall quality of L2 speaking performance. The origins of CAF-related notions date back to L2 pedagogy research in the 1980s, which made a distinction between fluency and accuracy in understanding the development of oral proficiency (e.g., Brumfit, 1984). The tripartite model of CAF was introduced under task-based language teaching and learning as a methodological framework to investigate the nature of language tasks in relation to the quality of language performances in all three CAF dimensions (see, e.g., the trade-off hypothesis proposed by Skehan, 1998; the cognition hypothesis proposed by Robinson, 2001). However, since the 1990s, the CAF framework has been widely adopted to develop dependent variables to measure L2 performance in SLA research (Housen & Kuiken, 2009). Under this framework, fluency is viewed as a subcomponent of language proficiency complemented by linguistic complexity and accuracy and often used as a key criterion to evaluate the quality of language performance (de Jong, 2018). Until now, this framework has been widely employed to characterize learner performance in both speaking and writing assessment (e.g., Biber et al., 2014; Gebril & Plakans, 2013; de Jong et al., 2013). Commonly used fluency features include speech rate (i.e., number of syllables per second), articulation rate (i.e., number of syllables per phonation time), and number of silent pauses (e.g., longer than 250 ms) (Freed, 1995; Lennon, 1990; Riggenbach, 1991). These features tend to be computed by counting the number of syllables or pauses produced during speech, offering a holistic representation of fluency (e.g., how fast a speaker speaks or how many pauses a speaker produces in a speech run) or overall quality of speech produced by L2 speakers; however, the simple computations of features lack fine-grained information about how speech output is generated (e.g., where pauses occur in a speech and how a speaker recovers from disfluencies), information that may relate more directly to the underlying construct of cognitive fluency.
A second strand of fluency research tends to examine fluency in its own right. These studies tend to examine fine-grained fluency features that are mostly associated with pause and repair, with an assumption that these features offer a more direct window into the nature of disfluency. Thus, they can be conceived to be a more direct reflection, as compared to holistic fluency features, of the degree of automatic processing (i.e., the retrieval and processing of lexico-grammar, and the planning and execution of speech) occurring in the mind, and can be operationalized in relation to non-temporal aspects of language performance (de Jong, 2016; de Jong et al., 2013; Kahng, 2014; Segalowitz, 2016). Fine-grained fluency features are believed to reflect the constraints in the speakers’ lexico-grammar knowledge and the speakers’ ability to overcome those constraints. Unlike holistic fluency features, fine-grained fluency features are not computed by counting all syllables or pauses produced indistinguishably; instead, they tend to count specific types of disfluencies (i.e., pause and repair). The identification of these disfluencies is dependent upon the lexico-grammatical features associated with them. Thus, they present fine-grained information about (1) where pauses occur, (2) how they occur (i.e., what happens around pauses), and (3) how they are recovered. Although disfluency is common in L1 speech (Clark & Fox Tree, 2002), it is more prevalent in L2 speech and can occur as a result of the speakers’ low proficiency or grammatical competence (Kormos & Dénes, 2004). Commonly used fine-grained fluency features include pause type (silent vs. filled pauses), pause location (juncture vs. non-juncture pauses), and pause repair (repair strategies and repair success). In this paper, we also classify mean length of run (i.e., the mean number of syllables produced between two silent pauses) as a fine-grained fluency feature, because it is not simply a holistic count of syllables produced in speech; instead, the variable incorporates both the number of syllables and the silent pauses and offers more fine-grained information than other holistic speed features regarding the speaker’s ability to produce longer speech runs without overt disfluencies. According to Ginther et al. (2010), longer speech runs often “display greater well-formedness and greater complexity at the phrase level than do shorter runs, suggesting that a speaker’s ability to construct phrases is a critical factor affecting production” (p. 388).
A prominent theoretical model of speech production in L1 is Levelt’s blueprint for the speaker (1989), which postulates that speech production goes through three major stages, namely, conceptualization, formulation, and articulation. Segalowitz (2010) adapted this model to develop seven fluency vulnerability points at different stages of Levelt’s model: microplanning, grammatical encoding, lemma retrieval, morpho-phonological encoding, phonetic encoding, articulation, and self-perception. According to the specifications of this model, the occurrence of disfluency is directly related to lexico-grammar. Kahng (2014) offered some empirical evidence for the validity of Segalowitz’s fluency vulnerability points by comparing fluency features in L1 and L2 speech. She found that although L1 and L2 speech differs in fluency features, such as the speed and number of silent pause, there was also a striking difference in fine-grained fluency features such as non-juncture pause (similar findings were reported in Skehan & Foster, 2008; Tavakoli, 2011). In the context of speaking assessment, fine-grained fluency features can offer evidence for cognitive validity (Weir, 2005), which investigates the cognitive processes engaged by test takers when responding to test items; the extent to which test takers’ response processes resemble the cognitive processes of real-life language use is regarded as important evidence for the construct validity of a test (AERA, APA, NCME, 2014). The cognitive validity of language tests has been extensively examined in tests of receptive skills, with much effort dedicated to understanding the micro-level skills (e.g., inferencing in reading comprehension) and task response process in these modalities (e.g., see Jang, 2009, and Bax, 2013, for reading assessment; and Field, 2012, for listening assessment). In contrast, relatively fewer efforts have been dedicated to examining cognitive validity in productive-skill tests. Indeed, research on speaking and writing assessments has focused more on the product than on the process. In writing assessment, recent research has elucidated the construct of writing by delineating the composing process within writing tests (e.g., Plakans et al., 2019). Such work in speaking assessment is scarce. It is arguable that strategic competence offers a window into the cognitive validity of speaking assessments (Youn & Bi, 2019); however, more research is needed to elucidate the speech production process. Apart from strategy, the instantaneous nature of speech makes fine-grained fluency features viable candidates for investigating the composing process of speech.
Dimensionality of speech fluency and its relationship with language proficiency
Despite the wide array of utterance fluency features used in language studies, the dimensionality of utterance fluency remains underexplored. Although the research reviewed above points to meaningful relationships among an array of fluency features and language proficiency, little is known about whether these features represent different dimensions of speech fluency and, if so, whether these dimensions have different relationships with (subcomponents of) language proficiency.
In language testing research, it is commonplace to operationalize proficiency through test scores and relate holistic fluency features with those scores (e.g., Ginther et al., 2010; Iwashita et al., 2008). However, compared to holistic features, fewer studies examined fine-grained fluency features such as the nature of pauses and repair, although these features can evidence interactions among fluency and speakers’ automatic access to grammar and lexis (Clark & Fox Tree, 2002; Corley & Stewart, 2008; Dörnyei & Kormos, 1998). Similarly, little is known about the relationships among various fluency features and other CAF variables in the context of speaking assessment, although the SLA literature has suggested meaningful relationships among them (e.g., Larsen-Freeman, 2009; Robinson, 2001; Skehan, 2009). In contrast, fine-grained fluency features are commonly examined in research in other fields (e.g., linguistics, speech and hearing science, communication), which has suggested that juncture pauses and repair success are closely related to automaticity and overall language proficiency (Freed, 1995; Hieke, 1981; Kahng, 2014; Park, 2016; Riggenbach, 1991; Temple, 1992; Wiese, 1984; although see, for exception, Riazantseva, 2001). That said, it is important to note that previous research shows a mixed picture regarding the relationship between pause location and proficiency level. Although Riazantseva (2001) did not find significant differences in the number of non-juncture pauses produced by intermediate to high proficiency L2 speakers and L1 speakers, other researchers found significant differences between L1 and L2 speakers and across L2 speakers of different proficiency levels (e.g., de Jong, 2016; Kahng, 2014; Skehan & Foster, 2008).
Although empirical research has shown differences in the relationships between individual fluency features and language proficiency, research that compares and contrasts holistic and fine-grained fluency features is rare in the fields of language testing and pedagogy. In addition, few studies have formally explored the dimensionality of speech fluency and CAF in general using factor analysis. Factor analysis has been widely performed with item scores to examine the dimensionality of test constructs. In speaking and writing assessment research, dimensionality has also been investigated among complexity features of speaking or writing performances employing a combination of corpus techniques and exploratory factor analysis (e.g., Biber et al., 2014; LaFlair & Staples, 2017; Yan & Staples, 2020). Following a similar analytic approach, in this study we utilize an array of holistic and fine-grained fluency features that were examined in previous research to explore the relationship between individual features and language proficiency. In addition, we perform an exploratory factor analysis on fluency features, with complexity and accuracy features added, to explore the dimensionality of speech fluency and its relationship with subcomponents of language proficiency. We consider language proficiency as the latent construct underlying all CAF features. To a certain extent, with this study we replicate previous research efforts in examining the relationships between fluency features and language proficiency (de Jong 2016; de Jong et al., 2013; Kang et al., 2010; Révész et al., 2016; Riazantseva, 2001), including those in language assessment contexts (e.g., Ginther et al., 2010; Iwashita et al., 2008). However, few studies have formally explored the dimensionality of fluency among a wide range of fluency features, and the inclusion of complexity and accuracy features can provide additional insights over the relationship between fluency features and lexico-grammar use. Specifically, to achieve this goal, we address the following research questions:
RQ1. What are the relationships among individual holistic and fine-grained features and overall language proficiency?
RQ2. Do fluency features show differential relationships with complexity, accuracy, and overall proficiency?
Methods
The Aptis spoken corpus
The spoken corpus we used for this study comprised speech samples drawn from responses to the final task of the Aptis speaking test administered by the British Council. We requested the speech samples from British Council as part of the 2016 Aptis Assessment Research Grants program. The Aptis speaking test assesses L2 English speakers’ production skills based on their ability to discuss personal experience and opinion on an abstract topic. In this final task, examinees are presented with a picture together with three questions, which are all related to the same topic. Examinees are given one minute for preparation and two minutes to respond. This task is designed to elicit language use that is similar to real-life oral communication as people are constantly asked to share personal experiences and express opinions on different issues. The Aptis test was developed with reference to the Common European Framework of Reference (CEFR) (North et al., 2010). On the Aptis speaking test, test takers receive both a scaled score (0–50) and a corresponding level on the CEFR, which adds meaningful score interpretations for test users (i.e., from the lowest to the highest: A1, A2, B1, B2, and C; https://www.britishcouncil.org/aptis-speaking-video). In this study, the CEFR levels reported by Aptis were used as our scale of proficiency. In this speech corpus, we selected 25 benchmark examinee performances from each of the five CEFR levels listed above. This resulted in a total of 125 speech samples. The duration of the speech samples ranged between 11.86 and 120.37 seconds, with a mean of 78.22 seconds (SD = 32.52). The examinees were adult L2 speakers of English from around the globe, representing a wide range of L1 backgrounds. The five most represented countries were as follows: India (25.6%), Saudi Arabia (10.4%), Colombia (8.8%), Mexico (8.8%), and Ukraine (7.2%). The five most presented L1 backgrounds were as follows: Malayalam (24.8%), Spanish (20%), Arabic (14.4%), Ukrainian (7.2%), and Mandarin Chinese (4.8%). The distribution of L1 backgrounds in the corpus is representative of the test-taker population on the Aptis test. There were slightly more female examinees (52.8%) than male examinees (47.2%).
Performance features used in the study
All speech files were first converted from .mp3 to .wav format by four undergraduate research assistants who were trained by the third author, and later background noise was reduced using Audacity (Audacity Team, 2018) for better sound quality. The research assistants transcribed the speech samples according to the Philadelphia Neighborhood Corpus (PNC) transcription guidelines (Labov & Rosenfelder, 2011) using ELAN software (Sloetjes & Wittenburg, 2008).
In this study, L2 speakers’ performance was measured in two ways: (1) audio-based measurement of holistic fluency features; and (2) transcript-based measurement of fine-grained fluency features, lexico-grammatical complexity, and accuracy features. A detailed explanation of how performance features were operationalized in this study is presented below.
Audio-based measurement of holistic fluency features was automatically performed on each speech sample using Praat (Boersma & Weenink, 2015) by the first author using a script developed by de Jong and Wempe (2009). This script measures the following features: speech rate (i.e., number of syllables per second), articulation rate (i.e., number of syllables per phonation time), and number of silent pauses longer than 250 ms per second. Number of silent pauses was further normalized into pauses per second. The first author also automatically extracted mean length of run (i.e., the mean number of syllables between two silent pauses) using the same script; however, as we argued earlier, this variable was considered as a fine-grained fluency feature in this study.
With regard to transcript-based measurement of fine-grained fluency features, the first and the second authors manually coded the following features based on the transcriptions done by the research assistants: pause type (i.e., silent and filled pauses), pause position (i.e., juncture and non-juncture pauses), and pause repair (i.e., repair strategy and success). The authors coded the pause type on a Praat TextGrid (Boersma & Weenink, 2015). Silent pauses were automatically detected by a praat script (Lennes, 2002) and filled pauses were identified as filler words such as um, uh, hmm and so on. Pause position and repair were manually coded directly on the transcripts. The example below demonstrates how the coding was performed. The first two columns indicate time stamps of the start and the end of each run which are separated by silent pauses longer than 250 ms. Juncture pauses, coded as “//” below, were identified when there were silent pauses at semantic and/or syntactic boundaries. Non-juncture pauses, coded as “#” below, were identified as silent pauses in other locations. Non-juncture pauses that were successfully repaired in the subsequent runs were coded as “*”. We defined successful repair as any attempt to recover from a disfluency that enables the speaker to complete the meaning of the original sentence before disfluency occurred. Unsuccessful repair refers to repair attempts that failed to enable the speaker to do so; instead, the speaker either produces an incomplete sentence after one or several disfluencies, or abandons the meaning of the original sentence completely and starts a new one. Although we did not consider silent pauses shorter than 250 ms in our analyses, if these short pauses were noticeable, we also marked them with “,”. Once the manual coding finished, we further normalized fine-grained fluency features and transformed them into two variables for statistical analysis: (1) proportion of juncture pauses (juncture pause rate hereafter, i.e., number of juncture pauses / number of silent pauses), and (2) success rate in repairing non-juncture pauses (repair success rate hereafter, i.e., number of successful pause repairs / number of non-juncture pauses).
1.002 2.193 One time I was #* 2.823 3.513 dreaming // 3.833 5.732 I was searching for a bank // 6.343 6.823 and // 7.873 12.612 I couldn’t find one and I went deeper and deeper to the city, into the city // 13.372 15.012 and then I realized I got lost // 15.762 16.123 and // 16.482 17.812 didn’t know where it was // 18.853 19.553 I um I #* 20.643 24.832 um put on my phone, and searched Google maps, but I had no #* 25.202 25.492 um
1
26.092 26.932 connection // 27.853 28.492 and I was # 29.862 30.562 it wasn’t #
We operationalized syntactic complexity as clausal subordination, following the findings from previous corpus-based research (Biber 1988; Biber et al., 2011). Specifically, we used the number of subordinations per C-unit. We evaluated lexical complexity using Coh-Metrix (McNamara et al., 2014). We did not use type–token ratio (TTR; Templin, 1957) to measure lexical diversity, as each speech sample had a wide range of length and TTR is sensitive to length (McCarthy & Jarvis, 2010). Thus, we used the measure of textual lexical diversity (MTLD), which is a modified TTR to adjust for text length, in order to represent lexical diversity.
In terms of lexico-grammatical accuracy, the second and the third authors manually identified errors in the domains of syntax and morphology for grammatical accuracy. We also identified lexical errors (i.e., incorrectly used or imprecise lexical expressions) for lexical appropriateness. To explore grammatical accuracy, the second and the third authors analyzed 15 transcripts (12% of the dataset) randomly selected from each proficiency level, following the coding scheme used in Plakans et al. (2019) (adapted from Bardovi-Harlig & Bofman, 1989; also used by Neumann, 2014). An acceptable agreement was found between the two authors; pairwise correlation between the two authors was .79. After the initial rating, the authors met and discussed errors in each transcript and worked together to refine their understanding of each error type. Once the two authors had reached an agreement on each coding category, the second author rated the rest of the transcripts. For analysis, the second author transformed the frequency of all errors into the number of error-free clauses per C-unit.
Data analysis
We transformed participants’ holistic speaking scores, as expressed in CEFR proficiency levels, into an ordinal, numeric scale (A1 = 1, A2 = 2, B1 = 3, B2 = 4, C = 5) for statistical analyses. We included two phases for the statistical analyses of this study: correlation analysis and multivariate analysis. Both phases of analysis were performed using RStudio, version 1.1.383 (RStudio Team, 2016). To address RQ 1, we examined the descriptive statistics of CAF features for possible trends across the CEFR levels. Spearman rho correlation coefficients were computed among the CAF features and the CEFR levels. We interpreted these correlation coefficients in light of which performance features tended to be associated with differences in speaking performance across the CEFR levels. To answer RQ 2, we incorporated all CAF features in a two-step multivariate analysis to determine whether patterns of co-occurrence among the CAF features can distinguish Aptis speaking performances across the CEFR levels. First, to reduce Type I error in the significance tests, we performed an exploratory factor analysis (EFA) with oblique rotation to reduce the array of performance features to a smaller number of interpretable CAF dimensions. Oblique rotation was used, because we assume that factors underlying CAF features are correlated, rather than orthogonal, to one another. Next, we computed factor scores of the reduced dimensions, which were subjected to MANOVAs and post-hoc univariate ANOVAs in order to examine whether factor scores of CAF dimensions can distinguish speaking performance across CEFR levels.
Results
Correlation among CAF individual features and proficiency level
Descriptive statistics of the individual CAF features of speaking performances across the CEFR levels on the Aptis test are presented in Table 1. Scatter plots of all of the correlations that we ran are presented in Appendices A and B in a supplemental file that readers can find online next to the Language Testing publication of this article. Overall, all fluency variables showed clear increasing or decreasing trends with proficiency level. Results of the correlation analyses confirmed that as the CEFR level increased, speech rate and articulation rate increased (rSR = .70, p < .001; rAR = .55, p < .001) and the number of silent pauses decreased (rSP = -.59, p < .001). In addition, higher-proficiency speakers produced longer speech, as indicated in the strong positive correlation (following Cohen’s [1988] guidelines for interpreting the magnitude of correlation coefficients) between mean length of run and CEFR level (rMLR = .60, p < .001). Although speakers at all CEFR levels paused, mainly in the form of silent pause, when silent pauses were further unpacked, higher-proficiency speakers tended to pause more frequently at syntactic junctures (e.g., clausal boundaries). This is demonstrated in the strong correlation coefficient between the juncture pause rate and CEFR level (rJcP = .84, p < .001). The results also showed a strong positive correlation between proficiency level and the repair success rate at non-junctures (rReSuc = .57, p < .001), which indicates that higher-proficiency speakers tend to be more successful at repairing non-juncture pauses.
Descriptive statistics of fluency, complexity, and accuracy features.
Note: * p < .05, ** p < .01, *** p < .001.
As for the correlations among complexity and accuracy features and proficiency level, the proportion of subordinate clauses per C-unit showed a linear trend that aligned with proficiency level. Spearman correlations showed a strong association between lexical sophistication and proficiency level (rMTLD = .79, p < .001), and a strong association between grammatical complexity and proficiency level (rSbCls = .51, p < .001). The correlation coefficient also revealed a strong association between accuracy measures and proficiency level (rErFCls = .77, p < .001). The result suggests that higher-proficiency speakers produce lexically and grammatically more accurate speech.
Table 2 presents the correlation matrix of fluency features. Again, all correlation scatter plots are in the online supplemental file for readers’ visual inspection. In general, correlation coefficients between holistic and fine-grained fluency features tended to be lower compared to those within each category. However, there were exceptions to this trend. Mean length of run is more strongly associated with number of silent pauses (rMLRSP = −.68, a holistic fluency feature) than with repair success rates (rMLRReSuc = .65, a fine-grained fluency feature). Similarly, number of silent pauses is more strongly associated with mean length of run than with articulation rate (rSPAR = −.60, a holistic fluency feature). Despite these exceptions, the differences in correlation coefficients between these three variable pairs were relatively small (we will further discuss the classification of these two features in the Discussion and Implication sections). Overall, these findings suggest some potential differences between holistic and fine-grained fluency features.
Correlation matrix of holistic and fine-grained fluency features.
Note: SR = speech rate, AR = articulation rate, SP = number of silent pauses, MLR = mean length of run, JcP = juncture pause rate, ReSuc = repair success rate.
Factor analysis of CAF features and relationships between factor scores and proficiency level
Upon completion of the correlation analysis, we performed EFA to further examine the relationships among the CAF features and proficiency level. Prior to running the EFA, the performance feature data were screened for the required statistical assumptions. The Kaiser-Meyer-Olkin (KMO) Measure of Sampling Adequacy for the performance features was .849, which suggests that the data is “meritorious” for EFA (Kaiser, 1974, p. 35). Bartlett’s Test of Sphericity was significant (χ2[36] = 569.586, p < .001). The skewness and kurtosis of all features were within the range of (−3, 3; see the histograms that are in the online supplemental file), suggesting that these features were approximately normally distributed. Correlation coefficients among all features were below .9 (see Appendix A in the online supplemental file). Taken together, these statistics suggest that the data were adequate for EFA. The scree plot of the EFA for the CAF features (see Figure 1) suggests a two-factor solution (see Appendix C in the online supplemental file for the initial eigenvalues and the percentage of total variance explained by the factors, which were used to determine the appropriate number of factors along with the scree plot).

Scree plot for the factor analysis of CAF features.
Table 3 shows the factor loadings for the CAF features. Factor 1, loading on both fine-grained fluency features and lexico-grammatical complexity and accuracy, accounted for 46.7% of the common variance among the CAF features. Distinctive from Factor 1, Factor 2 only loaded on holistic fluency features and accounted for an additional 18.08% of the common variance. Based on the factor loadings, the two dimensions were interpreted as follows: (1) Factor 1: automatic processing of lexico-grammar, (2) Factor 2: holistic fluency (the interpretation of these two factors is further explicated in the Discussion and Implication sections). Table 4 and Figure 2 present the descriptive statistics and boxplots for the factor scores across the CEFR levels. The correlations between the factor scores and proficiency level were strong for both dimensions (rF1_CEFR = .89, p < .001; rF2_CEFR = .68, p < .001), but the correlation between the two dimensions was moderate (rF1_F2 = .48, p < .001). These correlations’ scatter plots are in the online supplement.
Factor loadings for CAF features.
Note: Factor loadings < .3 are suppressed.
Descriptive statistics of factor score for CAF features by CEFR level.
Note: * p < .05, ** p < .01, *** p < .001.

Factor scores of CAF dimensions across the CEFR levels.
The subsequent one-way MANOVA of factor scores revealed a significant multivariate main effect for proficiency level (Wilks’ λ = .137, F(8, 240) = 33.688, p < .001, η2 = .63). Post-hoc ANOVA results showed significant differences across the CEFR levels on both the automatic processing of lexico-grammar (F[4, 121] = 115.786, p < .001, η2 = .85) and the holistic fluency dimensions (F[4, 121] = 22.262, p < .001, η2 = .53). Tukey post-hoc comparisons suggest that these two dimensions distinguish speaking performances at different adjacent CEFR levels. Specifically, on the holistic fluency dimension (see Table 5), speakers at and above the B1 level outperformed speakers at the A1 and A2 levels, suggesting that from a holistic lens, the B and C level speakers produce faster and longer speech with fewer pauses. Although an increasing trend was observed on the holistic fluency dimension score across the CEFR levels, there was no significant difference between the A1 and A2 levels or among the B1, B2, and C levels. In contrast, automaticity of lexico-grammar use statistically distinguished four CEFR level groups: A1 < A2 < B1 < B2/C. On this dimension (see Table 5), B2 and C level speakers outperformed the B1 level speakers, followed by the A2 and A1 levels. This suggests that as proficiency level increases, speakers are more automatic at retrieving complex lexico-grammatical resources and maintaining an acceptable level of accuracy or appropriateness in use. However, there was no significant difference between the B2 and C levels.
Tukey post-hoc comparisons of CAF factor scores across CEFR levels.
Note: a. Each subset column represents a significantly distinguishable group of CEFR levels; the levels within each subset column are not significantly different from one another. Each cell reports the mean value of the factor scores of each CEFR level.
Discussion
With this study we examined the relationships among an array of CAF features of Aptis speaking performances in an effort to explore the dimensionality of speech fluency. Although scholars theorize language proficiency as a multi-faceted construct comprising complexity, accuracy, and fluency (Housen & Kuiken, 2009), the findings of this study seem to suggest that holistic and fine-grained fluency features might represent different dimensions of speech fluency, and fine-grained fluency features are more closely related to the control of lexico-grammar than holistic fluency features. In what follows, we further explicate the differences between these fluency dimensions and their relationships with complexity, accuracy, as well as overall proficiency.
A moderately strong to strong association has been found between commonly used fluency features (speech rate, articulation rate, mean length of run, and number of silent pauses) and proficiency scores. The magnitude of these correlations aligns with findings of previous fluency investigations in language testing (e.g., Ginther et al., 2010; Iwashita et al., 2008) and in general applied linguistics (e.g., Kang et al., 2010; Révész et al., 2016), which suggests that holistic fluency features, especially speech rate and number of silent pauses, can be used as proxies for overall proficiency.
The strong correlations between fine-grained fluency features (i.e., mean length of run, juncture pause rate, and repair success rate) and proficiency levels are also worth noting. The juncture pause rate was revealed to be a stronger predictor than holistic fluency features such as speech rate and number of silent pauses. Although L2 speakers at all proficiency levels paused frequently, when silent pauses were further unpacked, the pauses produced by higher-proficiency speakers tended to occur at syntactic junctures (e.g., clausal boundaries) than did pauses produced by lower-proficiency speakers. This finding is consistent with the findings of the majority of previous research on this topic, which found a significant relationship between pause location and proficiency level (e.g., de Jong, 2016; Kahng, 2014; Skehan & Foster, 2008). For instance, Kahng (2014) found that higher-proficiency speakers produced fewer non-juncture pauses (i.e., pauses at clause boundary) than did low-proficiency speakers. Although the current study examined juncture pauses instead of non-juncture pauses, it confirms that pause location matters and suggests that higher-proficiency speakers are more at ease with parsing complex sentence structures and utilizing juncture pauses to formulate sophisticated content while maintaining the flow or smoothness of speech. In addition, when unexpected pauses occurred, higher-proficiency speakers tended to be better at repairing disfluencies by supplying the appropriate lexico-grammar items to sustain the meaning or topic of utterance. Conversely, lower proficiency speakers tended to show failed attempts in lexico-grammatical repair and consequently abandon the topic of the original utterance. This suggests that non-juncture pauses and speech repair reflect the speaker’s labored attempts at searching for appropriate lexico-grammar items, indicating the lack of automaticity of lexico-grammar knowledge.
Although correlation analyses confirmed strong relationships found in previous research between both holistic and fine-grained fluency features and proficiency level, this study also contributes new findings regarding the relationship between the two types of fluency features. Unlike previous studies that focused on the relationship between individual fluency features and proficiency levels, this study also examined the relationship among fluency features. In the correlation analysis, we noticed relatively weaker correlations between the holistic fluency features and the fine-grained fluency features. In this study, the correlations between the individual fluency features and proficiency level ranged between .55 and .84 (M = .64). The correlations among the holistic fluency features and among the fine-grained features ranged between .60 and .85 (M = .73) and between .58 and .70 (M = .64), respectively. However, the correlations between fluency features in these two categories only ranged between .28 and .69 (M = .44) (see Table 4). The trend of relatively weaker correlations between these two categories of fluency features provides a legitimate reason to speculate that there might be two different dimensions underlying fluency features.
Another contribution of this study is the use of factor analysis on individual fluency features, which offers further evidence to suggest that holistic and fine-grained fluency features represent two different fluency dimensions and have different relationships with lexico-grammar use. Although multidimensionality of speech fluency has been noted and conceptualized in previous literature (e.g., Lennon, 1990; Skehan, 2003), few studies have conducted statistical analyses to formally confirm the multidimensionality of speech fluency. In this study, through exploratory factor analyses, we found two different factors underlying the CAF features, namely, automatic processing of lexico-grammar and holistic fluency. In theory, all fluency features can be considered to reflect automaticity in language use. However, what is interesting is that the individual fluency features formed two separate clusters. Whereas the holistic fluency features clustered together, the fine-grained fluency features clustered with complexity and accuracy features. This distinction suggests that fine-grained fluency features (e.g., non-juncture pause and speech repair) constitute a closer reflection of the speakers’ labored search of lexico-grammatical items, and the lack of ability to repair those disfluencies indicates a low level of automatic processing of lexico-grammar.
The results of factor analysis largely correspond to the correlation patterns among individual fluency features. However, two features, mean length of run and number of silent pauses, showed mixed results. According to the correlation matrix in Table 2, mean length of run (a fine-grained fluency feature) showed a slightly stronger correlation with number of silent pauses (a holistic fluency feature) than with repair success rate (a fine-grained fluency feature). In contrast, the number of silent pauses (a holistic fluency feature) showed a slightly stronger association with mean length of run (a fine-grained fluency feature) than with articulation rate (a holistic fluency feature). Ginther et al. (2010) found that the number of silent pauses showed weak correlations with speech rate and articulation rate. However, the fact that fine-grained fluency features were not included in Ginther et al.’s study makes it difficult to directly compare the relative magnitude of correlations among holistic vs. fine-grained fluency features. One possible explanation for the mixed results on mean length of run and number of silent pauses is that these two features are a blend of both holistic and fine-grained characteristics (speed feature conditioned by pauses for mean length of run, and the raw count of pauses in the case of silent pauses), though to different degrees. Therefore, more research is needed to examine the generalizability of this finding.
That said, the correlations discussed above are all strong, the differences in the correlation coefficients discussed above are small, and the correlation patterns among other fluency features align with the results of factor analysis. Taken together, we maintain the classification of mean length of run as a fine-grained fluency feature and number of silent pauses as a holistic fluency feature. As we argued earlier, this classification is reasonable, as mean length of run is not simply a holistic count of syllables produced in speech; instead, the count of syllables produced is contingent upon locating silent pauses. The computation of this variable makes it carry a “fine-grained” flavor as a measure of speech fluency, indicating the amount of lexico-grammar items that can be produced within a single run. In contrast, the number of silent pauses, though a feature of breakdown fluency, is computed based on the quantity of pauses, rather than the nature (quality, e.g., where, why, and how) of pauses in speech production. Thus, it is reasonable that this feature shares commonality with other holistic fluency features (e.g., speech rate, articulation rate).
Based on the clustering of fluency features, we conveniently label the dimensions underlying the holistic and fine-grained features as macro and micro fluency, respectively. We define macro fluency as a dimension or a set of features that reflect the holistic assessment of speech flow, and micro fluency as a facet that shows where, why, and how disfluencies occur. Macro fluency includes articulation rate, speech rate, and number of pauses; these features carry a sense of “product” in speech production. In contrast, micro fluency includes juncture pause rate, repair success rate, and mean length of run; these features are more reflective of the automatic processing of lexico-grammar during speech production and more closely related to linguistic complexity and accuracy. The key to the distinction between macro and micro fluencies lies in whether the features carry fine-grained information about disfluencies or lexico-grammar use. To clarify further the differences between macro and micro fluencies, Table 6 presents a summary of individual fluency features, along with corresponding dimensions in this study as well as in those by Skehan (2003) and Lennon (1990). The majority of the speed fluency features (rapidity in Lennon’s classification) are classified under macro fluency, whereas the majority of the breakdown and repair fluency features (smoothness in Lennon’s classification) are classified under micro fluency. However, the classification of macro and micro fluency is not entirely equivalent to the distinction between rapidity and smoothness. That is, a micro fluency feature can incorporate rapidity as long as the measure of rapidity shows more fine-grained information about automaticity of lexico-grammar use in speech production (e.g., mean length of run). Similarly, a macro fluency feature can indicate smoothness of speech if the feature focuses on presenting a holistic quantity of disfluencies (e.g., number of silent pauses).
Classification of fluency features.
The results from multivariate analyses showed that when complexity and accuracy features are included in exploratory factor analysis, macro and micro fluencies have different relationships with overall language proficiency, providing a third layer of evidence for the distinction between these two fluency dimensions. The automatic processing of lexico-grammar dimension, that is, the combination of micro fluency, complexity, and accuracy features, could statistically distinguish four levels, namely A1, A2, B1, and B2/C. In contrast, the holistic fluency dimension, which only includes macro fluency features, captured the differences between the A1 and the B1/B2/C levels, and between the A2 and B2/C levels, but lacked the sensitivity to make finer distinctions between the A1 and A2 levels or among the B1, B2, and C levels. It is important to note that the difference in the ability to distinguish CEFR levels does not suggest that micro fluency is superior to macro fluency, as the automatic process of lexico-grammar dimension includes features in all three CAF dimensions. However, this result suggests that macro and micro fluency features represent speaker proficiency and variance thereof in different ways. While macro fluency features represent a holistic assessment of speech fluency, micro fluency features seem to possess stronger explanatory power in explicating how speech is produced and perceived, as they are more closely associated with lexico-grammatical features. Taken together, macro and micro fluencies complement each other in representing overall fluency and have differential relationships with complexity and accuracy in representing language proficiency.
It is interesting to note that no meaningful differences were observed between the B2 and C levels in both dimensions. Although it is possible that there is no meaningful difference in fluency among higher-proficiency speakers, this inference cannot be drawn without exhausting other plausible explanations. First, the sample size of this study might make a difference. Although different statistical procedures were employed to reduce type I error, the sample size of this study might not produce enough statistical power to detect finer differences between the B2 and C levels. Second, the array of performance features examined in this study, though comprehensive, is not exhaustive. There might be other features that can better characterize the difference between speaking performances at higher levels. One of those features could be precision and register/style of language use (e.g., the use of more precise language to differentiate finer shades of meaning; see Council of Europe (2001) for the descriptors for the C1 and C2 levels on the CEFR). Third, the lack of meaningful differences between the B2 and C levels might be because the Aptis speaking tasks only target up to the B levels on the CEFR, not above. Therefore, the tasks might not be sufficiently difficult to capture the differences between the B2 and C levels.
Implications for fluency-related research in language testing
The co-occurrence between micro fluency features and linguistic complexity and accuracy aligns with theoretical discussion about fluency and automaticity in lexico-grammar use (Lennon, 1990; Segalowitz, 2010; Schmidt, 1992) in that utterance fluency is a result of automaticity in accessing linguistic resources during speech production, and fluent speech is automatic, not requiring much attention or effort. The ability to retrieve grammatical and lexical items effortlessly for an extended period of time can add cognitive validity evidence for L2 speaking assessments.
The close relationship between micro fluency and lexico-grammar is not a new finding in L2 research. Previous research in L2 speech fluency has attempted to relate utterance fluency to cognitive fluency. De Jong et al. (2013) outlined two approaches to linking cognitive fluency to utterance fluency. One approach is to track utterance fluency features in the context of language development. The features that develop alongside proficiency development can be assumed to be closely associated with cognitive fluency. The other approach is to measure both cognitive and utterance fluencies cross-sectionally and identify the utterance features that are related to cognitive fluency measures. In both cases, research seems to suggest that cognitive fluency is more closely associated with micro fluency features such as the mean length of run and juncture pauses (i.e., pause location) (Kahng, 2014; Towell et al., 1996). However, in language testing research, micro fluency features have been given less attention in comparison to other subfields of applied linguistics. The inclusion of both macro and micro fluency features can avoid the issue of construct-underrepresentation in investigations of linguistic features as efforts to validate speaking assessments.
Admittedly, macro fluency features are more convenient and automatable, and this is an undeniable advantage over micro fluency features which often require manual, qualitative analysis. However, micro fluency features offer insights over the nature of language proficiency in ways that macro fluency features cannot. In practice, some commonly used proficiency scales or standards have included micro fluency descriptors to reflect cognitive processing in speech. For example, the Common European Framework of Reference (CEFR) described speaking ability using phrases like “pausing to search,” “false starts and reformulation,” “pausing for grammatical and lexical planning and repair,” and “avoiding or backtracking around any difficulty” (Council of Europe 2001, pp. 28–29). These descriptors entail micro fluency features that are more directly associated with the qualitative aspects of speech production and automaticity. Therefore, these features should be incorporated in validation research for speaking assessments as well.
In fact, the caveat of using only macro features occurs not only in the domain of fluency, but in all CAF domains. The holisticness of commonly used CAF features has been criticized in the investigations of lexico-grammatical complexity (Biber et al., 2016), with more corpus-based research venturing into the use of a wide array of fine-grained lexico-grammatical features using the Biber Tagger (Biber, 1988) or Coh-Metrix (Graesser et al., 2004). As traditional complexity features are measured based on holistic units such as T-units for writing (e.g., Cumming et al., 2006) and C-units for speaking (e.g., Iwashita et al., 2001; Skehan & Foster, 1997), they are typically computed as the ratio between either the number of words or target phrasal/clausal features and the number or length of the T/C-units. The assumption behind these features is that more complex utterances tend to be longer. However, as Norris and Ortega (2009) pointed out, this approach to operationalizing complexity might risk confounding complexity with length, which is often used to operationalize fluency. Biber et al. (2016) explained the pros and cons of both the holistic and fine-grained analyses of lexico-grammar and argued for a middle ground where analysis should include both fine-grained and holistic complexity features and then reduce those features to a smaller number of dimensions for statistical parsimony, power, and interpretability. Perhaps, this middle-ground can also be applied to the investigation of fluency features on language testing performance data.
Limitations and future research
Although this study has important conceptual and methodological implications for research and practice in language testing and teaching, it has several limitations that can potentially impact the generalizability of the findings. First, although this study was conducted on a larger sample of participants than previous fluency investigations (the majority of studies had a sample size smaller than 50), the sample size is still relatively small for a factor analysis. Although the fluency features were carefully selected and kept at a small number, it would be ideal if a large sample of data were collected. Future research can cross-validate the findings of this study on a large Aptis spoken corpus or other spoken corpora. With a larger sample size, a confirmatory factor analysis should also be performed to cross-validate this finding from the EFA. Second, this study operationalized syntactic complexity as subordination per C-unit instead of AS-unit (Foster et al., 2000). Future research can verify the generalizability of this study regarding syntactic complexity by using subordination of AS-unit, to check if using AS-unit would alter the correlation patterns and factor loadings. Third, this study only examines a particular type of speaking task, which, albeit representative of a certain real-life context, does not represent the full range of speaking contexts. Previous research suggests that fluency features of speaking performances are also influenced by the complexity of test tasks (see, e.g., Crowther et al., 2018). Thus, a variety of task types can also be included if this study were to be replicated. Fourth, the lack of information about examinees’ L1 fluency is another limitation of the study. The literature of speech fluency suggests that individual differences or cross-linguistic differences in L1 fluency can partly explain the variability in individual L2 fluency features (e.g., de Jong, 2016, 2018). Thus, the L1 effect on L2 speech fluency tends to be examined or acknowledged in fluency research. This forms a contrast from the context of language assessment research, where fluency is often evaluated without knowledge of the test takers’ L1 background, and it is arguably impossible in real assessment practices. That said, while we acknowledge that the relationships between fluency factor scores and proficiency level are potentially influenced by L1 effects, not including L1 fluency would not render the analyses in this study invalid. In addition, research suggests that the relationship between fluency and quality of language performance is also mediated through language proficiency level (e.g., Révész et al., 2016). In this study, the exploration of a mediation effect of language proficiency would not be reliable given the relatively small number of participants per CEFR level (n = 25); however, with a larger sample size, it would be an important empirical question to explore in future research.
Another limitation to this study was our underlying hypothesis that the relationships investigated would be linear. In the future, non-linear patterns could be investigated, and different independent variables that may explain or correct for any observed non-linearity (independent variables that are strongly implicated based on prior research) should be fitted to the data in order perhaps to explain any non-linearity. Potential independent variable candidates in such research, as identified here, are L1 fluency and L2 proficiency.
The limitations notwithstanding, this study provides strong supporting evidence for the differences between macro and micro fluencies. The emergence of different fluency dimensions aligns with earlier discussions that fluency should not be viewed merely as temporal features; instead, emphasizing the link between fluency and automaticity provides a more direct account for the occurrence and recovery of disfluencies. These two dimensions of speech fluency seem to be associated with different aspects of language proficiency. Future research that investigates fluency or speaking performance should include both dimensions in their operationalization. The omission of either dimension will risk underrepresenting the construct of oral language proficiency. With a fuller representation of the construct, we can also establish a better connection between research on speech fluency and disfluency from different disciplines, not only in assessment, but ultimately leading to a mutual conceptualization as well as operationalization of speech production by both L1 and L2 speakers.
Conclusions
To conclude, this study closely investigated temporal speech fluency at both macro and micro levels, in order to examine whether and to what extent different dimensions of speech fluency can be distinguished operationally. The strong correlations found among micro fluency features, lexico-grammar complexity and accuracy, and proficiency level suggest that speech fluency should not only be conceptualized as the amount or rate of speech production. It should also be considered as a reflection of the cognitive processes underlying speech production, connecting temporal features with language processing. In addition, the findings of this study demonstrate that the two approaches to operationalizing speech fluency point to different underlying constructs of speaking ability. Although macro fluency features are reasonable proxies for overall proficiency, micro fluency features add more information directly pertaining to the automaticity of lexico-grammar use. In this connection, we recommend examining both macro and micro utterance fluency features in the validation of speaking assessments. Similarly, the findings of this project stand to have broad implications for research in L2 teaching and learning, with respect to the development of L2 speech fluency and oral proficiency.
Supplemental Material
LT-19-0102.R4_Supplementary_Materials – Supplemental material for Dimensionality of speech fluency: Examining the relationships among complexity, accuracy, and fluency (CAF) features of speaking performances on the Aptis test
Supplemental material, LT-19-0102.R4_Supplementary_Materials for Dimensionality of speech fluency: Examining the relationships among complexity, accuracy, and fluency (CAF) features of speaking performances on the Aptis test by Xun Yan, Ha Ram Kim and Ji Young Kim in Language Testing
Footnotes
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was funded by the British Council as part of the 2016 Aptis Assessment Research Grants program.
Supplemental material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
