Abstract
This paper is an investigation into the use of the group oral discussion test (GOT) to detect changes in speaking proficiency over a two-year period. In this test, three or four test-takers discuss a topic for up to 10 minutes without outside intervention. The performance of 53 Japanese university English major students on this test was videoed before their classes started and at the end of their first and second years of study. Indices of complexity, accuracy, and fluency were calculated and interactive function was analysed to create indices that tracked initiating, responding, developing, and collaborating functions. Improvements were detected in most of the indices over the three administrations, with varying patterns of development. However, the test-takers’ scores in five rated scales only improved significantly in the second administration. Possible reasons for this discrepancy are discussed, as are the implications this study has for the GOT format and its administration.
The increasing interest in peer-interaction speaking tests over the last decade is likely due to their advantages within the communicative language learning paradigm. These tests, in which the assessed individuals interact with each other without intervention from the examiner, can be classified into paired and group oral tests (GOT) of three or more, the latter being the focus of this article. GOTs have been noted for their ability to generate positive washback in a communicative curriculum (Bonk & Ockey, 2003), their use with test-takers of mixed ability levels (Bonk & Van Moere, 2004), and their ability to elicit discussions in which members have equal speaking rights in an ESP context (Ockey, 2014). Less positive findings are that shy students may suffer a small but significant disadvantage (Bonk & Van Moere, 2004), the mix of extravert and introvert group members may influence the scoring (Berry, 2004; Ockey, 2009), the use of different prompts can significantly influence the pattern of discourse (Leaper & Riazi, 2014), and GOTs may be difficult to score reliably (Van Moere, 2006).
Although performances on all speaking tests are subject to influence from construct irrelevant factors, the scope and unpredictability of those outlined above may make the GOT format seem more appropriate for in-class rather than higher stakes institutional scale testing. Even so, to date studies have focused on single administrations, and often single factors. The key issue is whether the threats to the validity of the GOT overwhelm the construct of ‘speaking proficiency’ it measures. If this occurs, then every time a student takes a GOT their performance and scores would vary considerably, perhaps obscuring any development they make over time. On the other hand, if a cohort of students can show improvement consistent with what is known about the development of speaking skills, then this provides support for the ability of the GOT format to detect changes in student performance. If a cohort’s scores can reflect this development, it could provide benchmarks for the administration of the GOT in other contexts. These issues are investigated by tracking student performance on the GOT over two years and three administrations on quantitative measures of complexity, accuracy, fluency (CAF), and interactive function 1 as well as the students’ rated scores.
Factors affecting test-taker performance in group oral tests
Factors affecting performance on the GOT can be categorized as being related to the characteristics of the test-takers themselves, the composition of the group, or the test administration. Regarding the first, the impact of gender and shyness on GOTs was investigated by Bonk and Van Moere (2004). Their 1055 Japanese university students completed a seven-item shyness survey immediately after the GOT. The researchers found that gender did not affect scores, but there was a relationship between teachers’ pre-test predictions of their students’ scores and shyness. When the most outgoing students were compared to the shyest, the disadvantage of being shy was calculated as up to 2.5 points out of 20. Other studies investigating shyness had contradictory results. Ockey’s (2011) study of 360 Japanese university students found that self-consciousness, a subset of the NEO-PI-R neuroticism scale, had no effect on scores. This contrary finding to Bonk and Van Moere’s (2004) ‘shyness’ may be due to ‘self-consciousness’ being a narrower construct. Ockey (2011) also investigated assertiveness, one of the six elements of the ‘extraversion’ scale on the NEO-PI-R, and found a small but significant advantage on all scales, with slightly stronger relationships for communication skills and fluency than grammar, vocabulary or pronunciation.
The importance of such personal characteristics may depend on the construct of speaking being assessed. One perspective is that ‘language speaking ability’ should be abstracted from the context (Downey, Farhady, Present-Thomas, Suzuki, & Van Moere, 2008), and thus personal characteristics become threats to validity. Other researchers emphasize that speaking is a co-constructed and context-dependent act of communication (Chun, 2006), which may include the ability to give an opinion during a GOT. Although personal characteristics like ‘shyness’ might inhibit a student’s ability to achieve a higher score, they may also reflect the ability to speak in such contexts.
The second group of factors affecting performance stem from the mix of extravert and introvert test-takers in the group. Ockey’s (2009) study of 228 Japanese university test-takers found that assertive test-takers scored significantly higher when grouped with non-assertive test-takers, but significantly lower when grouped with other assertive ones. It seems raters rewarded assertive test-takers when their group mates were more passive, but being grouped with other assertive group mates presumably made them less conspicuous, and lowered their scores. An earlier study of 447 Hong Kong university students by Berry (2004) had similar results. However, unlike Ockey (2009), Berry also found that introverts’ scores were significantly lower in low extraversion groups, and elevated in high extraversion groups. Possible reasons for these differences are that, in Berry’s study, extraverts enabled introverts to produce more language, or they may have engendered a more convivial atmosphere to which raters responded. However, since the above studies did not investigate the language elicited by the test, it is unclear what the raters were basing their judgments on. Gan’s (2011) study of 39 Hong Kong high-school students used CAF indices produced from the test-takers’ language, but unlike Berry (2004) or Ockey (2009) found no significant difference between extraversion and their scores or the CAF indices. It is possible that Gan’s use of self-selected groups from a classroom influenced this finding. Ockey (2009) and Berry (2004) manipulated their participants in order to produce groups that were asymmetric with regard to extraversion, and thus Gan’s groupings may not have varied to the same extent. Nonetheless, the possibility of extreme mixes of personality affecting the scoring must be an ongoing concern when rating GOTs.
The final category is related to the administration of GOTs. A study by Nakatsuhara (2011) found differences owing to the size of the discussion group. The 269 Japanese high-school students in her study took the extraversion scale of the Japanese Eysenck Personality Questionnaire (EPQ), and self-selected themselves into groups of three or four. Nakatsuhara found that in groups of four, extraverts were more influential, whereas groups of three engendered a more collaborative atmosphere in which the members were more likely to support each other. However, Nakatsuhara’s participants were taking part in an experimental event that would have lowered the stakes compared to an authentic assessment. Further research in higher stakes contexts is needed for confirmation.
The prompt is another administrative aspect that affects performance on GOTs. In a study of 141 Japanese university students, Leaper and Riazi (2014) found that the prompt significantly influenced the interactional pattern of the discussion. Prompts that encouraged students to explain a back-story elicited discussions with significantly fewer, longer and more complex turns than those related to a factual here-and-now subject. A prompt on a more personal subject had a negative impact on the test-takers’ fluency as their responses had a significantly higher proportion of pauses. Despite the differences found in patterns of interaction and CAF indices, there was no significant difference to the scores when grouped by prompt, supporting Fulcher and Reiter’s (2003) contention that test scores are remarkably robust in the face of even substantial task variation.
Finally, the possibility of scoring the GOT reliably has been questioned. Van Moere’s (2006) study used a test–retest design with 113 English major students at a Japanese university. The inter-rater reliability coefficient was calculated to be 0.74 and statistical analysis revealed that most of the variation in scores was owing to test-taker performance. Van Moere suggests that such factors may be related to interlocutors or group dynamics, such as those mentioned above.
Collectively the above studies show the potential impact of factors related to the personalities of the test-takers, the composition of the groups and the administration of the test. The question remains as to whether they obscure the ability of this test to detect development in the test-takers’ speaking ability over time.
Development in CAF and interactive function over time
Over recent decades, a large body of literature attests to the usefulness of the dimensions of complexity, accuracy and fluency (CAF) to index language development (see Housen & Kuiken, 2009). However, relevant studies investigating their longitudinal development in speaking is limited (Vercellotti, 2017), and further research has long been called for (Larsen-Freeman, 2009). Among the few studies that had multiple data collection points over a period of time of a year or more, a substantiated finding is that different elements of CAF develop at different rates. Fluency has been found to improve in the initial periods of studies, but then remain constant in later periods (Serrano, Tragent, & Llanes, 2012). In one study, the participants’ fluency developed by reducing the pauses in their speech before increasing their speed of speaking (Koizumi & Katagiri, 2009). Accuracy, on the other hand, had a more gradual path of improvement in which improvements may only be seen in later data collection points (Koizumi & Katagiri, 2009; Serrano et al., 2012). For complexity, the findings have been less consistent, with studies finding small improvements in syntactic complexity (Koizumi & Katagiri, 2009) or none at all (Serrano et al., 2012; Mora & Valls-Ferrer, 2012).
The literature on the development of interactive functions is even more limited, as to date only a cross-sectional study can provide a benchmark (Galaczi, 2013). At the lowest level, Common European Framework of Reference (CEFR) B1, test-takers typically suggested new topics that they extended themselves, with little use of backchannels and abrupt transitions to the next topic. The main development for learners at the next level, B2, was the capacity to develop other-initiated topics. Some collaborative features, such as the occasional jointly constructed turn, could also be found, although as listeners they were limited to simple backchannels, such as ‘yeah’. The C1 level was marked by a demonstrably higher level of mutuality, reciprocity, and jointly constructed meaning in conversation. The participants could confidently develop topics and extend them over several turns, and backchannels were used to confirm the other’s opinion in short statements such as ‘yes, I see’ and ‘indeed’. At the next level, C2, the test-takers could do all this in a smoother and more proficient manner. These findings provide useful insights, although how these features develop over time remains an open question.
The limited research on the longitudinal development of CAF and the interactive function provides some indication regarding the pattern of development one may observe among GOT test-takers. The other aspect that this study explores is whether the observed development can be detected in the scores awarded for the test-takers’ performances.
The research questions are as follows:
RQ1: Can development be detected in test-takers’ indices of complexity, accuracy, fluency, and interactive function in the group oral test over two years of instruction?
RQ2: Can development be detected in group oral test-takers’ scores in five rating bands for language performance over two years of instruction?
Methods
Participants
The participants were 53 English major students at a Japanese university who had a relatively uniform background: they were 18 years old when they first took the test, the majority were female (79%), and they had studied English for at least six years at middle and high schools, although some had started earlier. As incoming students, they took the GOT before classes began, and then at the end of their first and second years. In their first two years of study, students attended four 14-week university semesters, in which they had 15 hours per week of English skills classes taught by native English speakers, resulting in a total of 420 hours of classroom time with NS teachers before they took the speaking test for a second time, and 840 hours before the final administration.
The group oral test
The GOT is the speaking section of a four-skills test designed to be closely related to the students’ communicative curriculum, in which peer interaction in pairs or groups is a commonplace activity whatever the particular focus of the course – there is no specific class on ‘how to speak in a group discussion’. For the GOT, they are randomly grouped with students from other classes and with raters who are not their teachers in order to minimize familiarity effects. The scores from the test’s sections make up 20% of the relevant course grade (e.g., 20% of the oral communication class’s score comes from the GOT), and their overall test score is used for placing students in classes at four different levels.
Tasks
The GOTs have three or four participants who discuss a single multi-question prompt (see example in Appendix A) for up to 10 minutes or until the raters have enough information for scoring. In practice, the median length is about 7 minutes. For each administration, three or four prompts are created and translated into Japanese to ensure students’ understanding; one prompt is randomly chosen by the raters before the GOT starts.
Raters and training procedures
Raters are trained using about eight videoed GOTs from the previous administration of the test. Before the training session, the test committee carefully selects videos that provide a range of proficiency levels and together they decide on the most appropriate scores for the test-takers on pronunciation, fluency, vocabulary, grammar and communicative skills (see Appendix in Leaper & Riazi, 2014). In the training session, teachers grade these videos, and compare their scores to those of the test committee. The ensuing discussion deepens their understanding of rating decisions.
Data organization and coding
Over the three administrations, the same rating bands were used, but the test-takers had different interlocutors, raters, and prompts, according to standard procedures. In every administration, some students are assigned to have their tests videoed, and for this study, the same students were recorded in successive administrations. The video transcripts were copied into Microsoft’s Excel, and timed by a research assistant before being checked by the researcher. The transcripts were analysed using Analysis of Speech Units (AS-units) (Foster, Tonkyn, & Wigglesworth, 2000), and words, clauses, error-free clauses, and other voiced phenomena (maze words and voiced fillers) were counted. 1 Errors in the syntax, morphology, word order or appropriacy of the words in the clause were only counted if they were definitively wrong (Foster & Skehan, 1996). A completely coded test was audited independently by a qualified language education expert, with over 20 years’ teaching experience. The agreement between the original and the auditor’s codings was 88.4%. Differences were discussed and the coding system was adjusted to ensure consistency.
For timed phenomena, more precise measurements were obtained using the software Audacity. A turn was timed from the start of the sound a speaker made before his or her first word was uttered to the final audible sound before another participant spoke. Turns were distinguished from backchannel by defining them as utterances that were verbally responded to, or clearly spoken words that were available to be responded to, and this was sufficient in most cases. Doubtful cases were marked and agreed upon by consensus.
Performance indices
The vast majority of research that used CAF indices collected data from one-way information transfer tasks rather than two-way conversational tasks. Since it can be inferred from Nitta and Nakatsuhara (2014) that longer turns are more appropriate for rating oral fluency, only turns of 10 seconds or more were included. This excluded quickly spoken memorized chunks such as “I don’t know” that would have inflated fluency indices. Only including these longer turns limited the data, as four participants in the first administration and two in the third had no turns of over 10 seconds in length, and so were excluded from the fluency data. 2
The measures used are outlined below.
Complexity
Mean length of utterance (MLU): pruned words per AS-unit, (Foster & Tavakoli, 2009).
Syntactic complexity: Clauses per AS-unit (Skehan & Foster, 1999).
Accuracy
Global accuracy: the proportion of error-free clauses to total clauses (Foster & Skehan, 1996).
Error-free clauses per opportunity to speak: the number of error-free clauses normalized per participant, per test time.
Speed fluency
Articulation rate: syllables spoken as a proportion of speaking time – excluding unvoiced pauses (Towell, Hawkins, & Bazergui, 1996).
Speech rate: syllables spoken as a proportion of time of speaking time – including unvoiced pauses (Towell et al., 1996). As this measure includes pauses, it is a composite of speed and breakdown fluency (De Jong et al., 2013).
Breakdown fluency
Pause proportion: time spent in unvoiced pauses of one second or more in speaking time (Iwashita, Brown, McNamara, & O’Hagan, 2008).
Repair fluency
Maze & sound ratio: proportion of repetitions, false starts, self-corrections not necessary for communication (Tavakoli & Foster, 2008) and voiced pauses 3 to unpruned words spoken.
Interactive functions
A list of interactive functions was derived from an analysis of the transcripts, and labelled to be as consistent as possible with previous analyses (Brooks, 2009; Eggins & Slade, 1997; He & Dai, 2006; Van Moere, 2007). On principle, AS-units were used to demark boundaries of the interactive functions, but in a few cases they did not match perfectly. Also, in some rare cases (less than 1% of all codings) it was necessary to apply more than one interactive function to an AS-unit.
The analysis resulted in a list of 30 functions which were categorized into Initiating, Responding, Developing, and Collaborating (see Appendix B for all functions), and normalized per opportunity to speak for statistical analysis. 4
Initiating functions: these were produced on the test-taker’s volition to start or add to an interaction. Turn 1 in Excerpt 1 is typical of an opening sequence where a question is used to start the discussion. Also included were such phenomena as follow-up questions to ask for more information and questions used to transfer a turn (see Appendix B, Table A.1).
Responding functions: these were responses to another participant, usually the second half of an adjacency pair in which the first part was an initiating move. In Excerpt 1, the first AS-unit of turn 2 is a Response since it is the minimally sufficient answer of the question in turn 1. For other kinds of Response, see Appendix B, Table A.2)
Developing functions: these were used to expand a participant’s turn beyond the initial response. The second AS-unit in turn 2 of Excerpt 1 (“I have been England and Korea”) is an example since it adds more detail to the Response. Table A.3 in Appendix B has the other Developing functions.
Collaborating functions: AS-units in which enhanced attention is paid to the previous speaker’s utterance by checking understanding, meaning, correcting, or co-constructing. Excerpt 2 provides an example of co-construction where A completes the sentence that C started. Other forms of collaborating are in Table A.4 in Appendix B.
The reliability of the coding was checked by a second rating of the complete transcripts of two GOTs (n = 8) by an experienced colleague with an MA in education. The tests came from the first and last administrations. This sample accounted for eight out of 159 individual performances, (just over 5%), which was considered acceptable given the corpus size. After training, the auditor achieved exact matches in 79.86% of the functions. Many of the codings that disagreed were functions belonging to the same higher category, of which 89.93% were exact matches, which was acceptably high. The conflicting codings were discussed and agreement reached over changes to the definitions, which were applied to the data.
Statistical procedures
For research question 1, to test for significant differences the non-parametric Friedman’s ANOVA was used because neither a normal distribution nor homogeneity of variance could be assumed. This was confirmed by a visual inspection of boxplots and significant results from the D’Agostino–Pearson K2 test of normality (D’Agostino, Belanger, & D’Agostino, 1990). The Wilcoxon signed ranks tests were then conducted to identify significant relationships between indices from the three administrations (Field, 2005). The effect size (r) was calculated from the results of the Wilcoxon signed ranks test by dividing z by the square root of the sum of the number of observations from which the comparison was made (Field, 2005).
With multiple comparisons there is a heightened risk of committing Type I errors. This was taken into account by applying the Bonferroni correction, which sets a stricter level for p by dividing the value set for the study (0.05) by the number of comparisons (3), resulting in p = 0.017 for this study. Although this conservative correction runs the risk of committing Type II errors (Field, 2005), it was used due to the relatively low number of observations and high number of statistical tests conducted in the study.
For research question 2, a within-groups repeated measures ANOVA (RM ANOVA) was used to compare the scores in the rated scales from the three administrations. Box plots showed that the data varied somewhat from the assumptions of normality and equality of variances. The data was transformed by a log function, but since no improvements were discerned the original figures were used with the knowledge that some statistical power may be lost (Larson-Hall, 2010, p. 340). The Bonferroni post hoc test was used for pairwise comparisons.
Results
Research question 1: detecting changes in the students’ performance indices
Overall, the statistical tests indicated significant differences in all but two of the performance indices with mostly small but some medium effect sizes. The performance indices are displayed graphically in Figures 1–7. Descriptive statistics can be found in Appendix C; the results from the Wilcoxon signed ranks are in Table 1.

Complexity: median MLU across three administrations (n =3).

Syntactic complexity: median ratio of clauses to AS-units across three administrations (n = 3).

Accuracy: normalized number of error-free clauses across three administrations (n = 3).

Accuracy: median proportion of error-free clauses to all clauses across three administrations (n = 3).

Speed fluency indices: Median speech and articulation rates across three administrations (n = 3).

Repair and Breakdown indices: median proportion of pauses and ratio of maze and sounds to words across three administrations (n = 3).

Interactive indices: the median of normalized Developing, Responding, Initiating, and Collaborating functions across three administrations (n = 3).
Wilcoxon signed ranks tests for the significant CAF results.
based on positive ranks; b based on negative ranks.
Note: MLU = mean length of utterance.
Complexity
Figures 1 and 2 show the complexity indices. The MLU graph (Figure 1) shows a substantial increase was detected in the second administration and a minimal increase in the third administration. Friedman’s ANOVA found a significant difference (χ2(2) = 8.576, p = 0.01), and the Wilcoxon signed ranks test identified it as being due to the difference between the first and the second administration, with a small effect size (z = −2.439, p = 0.015, r = 0.237). In Figure 2, syntactic complexity follows an increase with a sharp drop, but no significant relationships were found (χ2(2) = 1.380, p = 0.502). This non-significant finding is consistent with other research into language gain (Mora & Valls-Ferrer, 2012; Serrano et al., 2012), so it is not an unheralded result.
Accuracy
The most noticeable feature of the accuracy indices in Figures 3 and 4 are the considerable increases detected in the third administration. Friedman’s ANOVA indicated significant differences in both indices (Error-free clauses: χ2(2) = 26.190, p < 0.001; Error-free proportion χ2(2) = 22.340, p < 0.001). The Wilcoxon signed ranks test found that the number of error-free clauses per opportunity to speak (Figure 3) was significantly different in both the second (z = −3.729, p < 0.001, r = 0.362) and third administrations (z = −4.165, p < 0.001, r = 0.405) with small and moderate effect sizes respectively. In Figure 4, there was no significant difference in the proportion of error-free clauses in the first period, but there was between the second and third administrations with a medium effect size (z = −5.613, p < 0.001, r = 0.545). This pattern is explained by the test-takers’ using more error-free clauses in the second test, although proportionately making as many mistakes as in the first. In the final test their accuracy improved proportionally along with the number of clauses, consistent with the development described by Koizumi and Kitaguri (2009).
Fluency
For the speed fluency indices in Figure 5, an upwards trend represents increases in fluency, whereas for the repair and breakdown fluency indices, in Figure 6, a downwards movement indicates improvement as students reduce the disfluencies in their speech. Friedman’s ANOVA detected significant differences in all four of these indices (Articulation Rate: χ2(2) = 6.125, p = 0.047; Speech Rate: (χ2(2) = 41.292, p < 0.001; Pause Proportion: (χ2(2) = 38.292, p < 0.001; Maze & sound ratio: (χ2(2) = 26.167, p < 0.001). The Wilcoxon signed ranks test showed varied patterns of development. The indices that were significantly different only between the first and second administrations were Speech Rate (z = −4.974, p < 0.001, r = 0.492) in Figure 5, and Pause Proportion (z = −5.397, p < 0.001, r = 0.534) in Figure 6, both with moderate effect sizes. This similarity is probably owing to the Speech Rate’s inclusion of pauses within speech. The Articulation Rate (in Figure 5), a pure measure of speed fluency, had a pattern of delayed development in which there was a significant difference only between the second and third administrations, with a small effect size (z = −2.489, p = 0.013, r = 0.244). Likewise, repair fluency, as measured by the Maze & Sound Ratio (Figure 6) was also only significantly different between the second and third administrations (z = −2.911, p = 0.004, r = 0.285).
Interactive functions
In Figure 7, all interactive indices show an increasing trend. Friedman’s ANOVA found that only Collaborating functions (χ2(2) = 0.146, p = 0.93) were not significant, probably due to their rarity in the data, but the remaining indices had significant differences (Initiating: χ2(2) = 10.106, p = 0.006; Responding: χ2(2) = 6.577, p = 0.037; Developing: χ2(2) = 21.981, p < 0.001).
The results of the Wilcoxon signed ranks tests (Table 2) indicate significant differences in the Developing (z = −3.801, p < 0.001, r = 0.369) and Responding functions (z = −2.532, p = 0.011, r = 0.246) in the second administration with medium and small effect sizes respectively. Differences in initiating function ratings were only significant between the first and final administrations (z = −2.807, p = 0.005, r = 0.273), but the larger effect size (r = 0.225 in the second administration, as opposed to r = 0.164 in the third) indicates that this was largely owing to the difference between the second and third administrations.
Wilcoxon signed ranks tests for the significant interactive function results.
Note: All figures based on negative ranks.
The Interactive indices detected a pattern of development in which test-takers improved initially by making more responses to the question being discussed and taking longer turns. This focus on their own speaking may have decreased the opportunity to use Initiating functions in the second test. The increased usage in the Initiating functions in the final administration suggests that test-takers are moving from a focus on their own speech to paying greater attention to the interaction of the discussion, consistent with Galaczi (2013).
Research question 2: detecting development in the students’ scores
The statistical tests found significant differences in all the scales between the first and second administrations, but only the vocabulary scale differed significantly between the second and third administrations. Figure 8 displays the test-takers’ average scores, and the descriptive statistics can be found in Appendix C, Table A.7.

Average scores in rated scales across three administrations (n = 3).
The RM ANOVA and post hoc Bonferroni comparisons (Table 3) showed that all scales in the second administration were significantly higher than those in the first, but in the third administration the only significant result was the change in their vocabulary score, which was lower than the second administration. To answer the research question, development seemed to be detected in all the test-takers’ scales after one year of study, but in the second year, this appears to have stalled, and in the case of vocabulary, retreated.
Figures returned by the RM ANOVA on GOT scores.
Note: Comm. skills = communicative skills.
Discussion
The first objective of this paper was to investigate the changes detected in the performance indices of GOT test-takers on three tests conducted over two years. Complexity, as measured by the MLU, improved only in the second administration, showing that test-takers spoke in longer units in their second GOT than their first, but did not improve further in their third test. Of the two accuracy indices, the number of error-free clauses also improved only in the second administration, but the proportion of error-free clauses only improved in the third administration. The GOT could detect a two-stage pattern of development: in the first stage, a quantitative improvement was detected in the test-takers’ speaking, which included a greater number of error-free clauses. In the second stage, the increase in the proportion of error-free clauses in their final GOT indicates a qualitative improvement in their speaking. This delayed onset of improvement in accuracy is consistent with previous studies (Koizumi & Katagiri, 2009; Serrano et al., 2012). The congruency of these results suggests that learners develop the ability to speak at greater length before their accuracy begins to improve.
For fluency in the second administration, test-takers improved their breakdown fluency initially by reducing the proportion of pauses in their speech, whereas in their third test, they improved speed and repair fluency, as seen by their gains in their rate of articulation and reduction of maze words and sounds. Although it is a complex phenomenon, at lower levels of development pausing is often symptomatic of the cognitive complexity of speech production and so is typically associated with a speaker’s need to search for appropriate lexical resources or to engage in “unexpected online planning” (Foster & Skehan, 1999, p. 229). The improvements in their pausing may be indicative of a development in automaticity that frees up learners’ mental resources, allowing them to improve their articulation and maze words.
The Speech Rate, which combines elements of speed and breakdown fluency, improved in both second and third administrations. The larger effect size in the second administration was driven by the reduction of pauses in the data. By their third test these test-takers were talking more quickly with fewer maze words and filled pauses, but the amount of pausing between chunks of language is comparable to that in their second test. These results are consistent with Serrano et al. (2012) and Koizumi and Katagiri (2009), where the improvements with the largest effect sizes were observed in the third administration in measures which included pausing phenomena. Unfortunately, those studies did not include repair fluency measures or a purer measure of speed fluency (such as the Articulation Rate), making further research necessary for confirmation.
The pattern most clearly seen among the Interactive functions was the improvement in Responding and Developing functions that occurred only between their first and second administrations. These test-takers first developed their ability by contributing more frequently to the discussions and expanding on their responses. This is consistent with Galaczi’s (2013) findings that a feature of low level learners (at B1) is the focus on extending their own contributions, and this is borne out by the largest effect size in the Developing functions in the second administration, as these functions are used by speakers to extend their talk. In Galaczi’s study, students developed by improving their listener skills and ability to jointly develop topics, and this can be seen in the gradual improvement in Initiating functions between the first and third tests. The larger effect size that occurs in their second year points to the slower development of initiating skill than the ability to speak at greater length, supporting Galaczi’s (2013) cross-sectional study.
The inability of the test to detect development in the Collaborating functions may be explained by their rarity: even by the time of the third administration no individual function had an average close to a single usage per participant. This is consistent with previous research that examined similar functions (He & Dai, 2006; Van Moere, 2007). Several reasons may be put forward to account for this. First of all, as is suggested by Galaczi (2013), collaborative features are more prominent at higher levels (C1 and above), and it is possible that even by the third administration few learners in this study had reached this level. Furthermore, given that collaborative functions are dependent on other participants for their use, it may be that the format of the test fails to give opportunities for the use of collaborative functions, as other studies on the GOT have suggested (Van Moere, 2007). Another possibility is that test-takers may find using such functions inherently risky or face-threatening for themselves, as well as fellow participants, and feelings of solidarity may encourage them to accept rather than question their fellow test-takers’ words (see Luk, 2010). Finally, with Japanese L1 participants, cultural norms about the appropriacy of these functions may play a role, since features such as overlaps and interventions are not as common in their L1 conversation (Furo, 2013).
The other index in which no significant difference was found was in syntactic complexity. One explanation of this may be Norris and Ortega’s (2009) contention that this measure is more appropriate for higher level learners. If the participants were not of a high enough level, then this measure would not be appropriate for them. While there may be some truth in this, a more likely explanation lies in the nature of the data and the influence of the prompt. Using data collected during the second administration (including many of the same participants in this study), Leaper and Riazi (2014) found that two of the four prompts elicited discussions in which turns were significantly longer and more complex. The spike in the second administration seen in Figure 2 may be explained by 58.5% of the test-takers in the current study responding to these prompts. This finding shows the necessity of tailoring prompts to ensure they are more likely to elicit responses that align to the purpose of the test, as concluded by Leaper and Riazi (2014).
The finding that the GOT can detect speaking development needs to be incorporated into a body of research that has exposed various factors that increase the variability of test-taker performances. Since validity derives from the inferences drawn from the scores (Messick, 1995), the interpretation of GOT scores must take into consideration this array of negative influences. For GOTs, it seems that inferences can be made not only about language proficiency but also the individual’s ability to actively participate in the co-construction of a discussion in a stressful situation. Certain stakeholders might find this aspect of the GOT format to be informative of the ability of the test-taker.
The second research question investigated changes in the students’ scores over two years of instruction. Although the finding that all rated scales improved from the first to the second tests seems consistent, the same cannot be said for the period between the second and third tests. Despite significant improvements being detected in the performance indices, no corresponding improvement in their scores was found. The most likely explanation for this discrepancy are the inconsistencies in the scope of language features described by the rating bands and indices: while the indices do not vary over the duration of the study, this may not be the case with the features of spoken discourse described in the rating bands (see Appendix of Leaper & Riazi, 2014). For example, the 2.0 band descriptions may cover too wide a range of ability (or be interpreted too liberally by the raters), and once a test-taker crosses this threshold the rater finds it difficult to award a higher score. Moreover, the rating bands are subject to interpretation by human judges whose ability to misinterpret, misuse or ignore score descriptors (Douglas, 1994; May, 2009; Orr, 2002) should not be underestimated. The inconsistent ability of the scoring to detect development requires local practices such as rater training and administrative policies to be scrutinized and reformed.
A path towards developing a scoring rubric to detect improvements in performance would be to revise the rating bands. Ideally the scales would reflect the progress students typically make over the duration of the course. The performance of incoming students should be described in the first band, and the second and third bands their development after one and two years respectively. Developmental progress as shown by the CAF indices suggests that in the second band for the grammar scale, complexity should be interpreted as ‘length of clause’, as only this was found to improve significantly in the second test, and accuracy should be emphasized in the third band. The second band of the fluency scale should foreground improvements in breakdown fluency and in the third, improvements in speed fluency. In scales that cover communicative skills, responding and developing functions are key improvements to include in the second band, while initiating functions should be the focus of the third band. Changing the rating bands so they reflect development would potentially make scoring easier and improve the washback for students as they see their progress reflected by their test scores.
Along with these findings, the limitations of this study need to be considered. First of all, the low number of observations and the use of many statistical tests inevitably increase the possibility of error. Additionally, covariance among the indices may affect the number of significant findings. Although steps were taken to mitigate these possibilities, future research with a larger dataset would be desirable. Although steps were taken to mitigate these possibilities, future research with a larger dataset would be desirable. Recent advances in statistical methods would allow more accurate modelling of the complex, dynamic and nonlinear development of language at both an individual and at cohort level (Murakami, 2016).
Further, this study was limited to a quantitative approach; future studies should examine the phenomena identified herein using mixed methods that would match findings from conversation analysis to the rating bands. Collecting samples of the test-takers’ language use outside of the test would investigate whether or not the findings from performance indices could be corroborated.
Another limitation is that most of the indices rely on counting the instances of a given phenomenon, which does not take into account relevance or quality. This limitation particularly affects the analysis of interactive function, where a bare count of the features neglects the effectiveness of the utterance in a specific context. For example, the Developing functions measure quantity, but the index merely includes the number of AS-units rather than what was actually talked about. Nonetheless, previous research has generally found that talking more is positively correlated to scores, suggesting that raters tend to value the contributions of those who talk more (Kobayashi & Van Moere, 2003). The use of qualitative methods such as conversation analysis would enable further insights into this issue.
Conclusion
Despite such limitations, this study has filled a gap in the literature on speaking tests by providing a quantitative analysis of test-taker’s progress over a number of administrations in terms of indices of speaking performance and the scores awarded by raters. In doing so, it sheds light not only on the GOT as an assessment of students’ speaking performance, but also on the challenge faced by the administrators to develop speaking assessments that accurately record the development of speaking proficiency at their institutions.
Footnotes
Appendix A: Example of prompt used
In the traditional Japanese family, men earned the money and women did the housework. Is your family traditional or not? Why do you think so? What are the advantages and disadvantages of the traditional family? Do you think the situation in Japan is changing? Why?
Appendix B: Interactive functions
Summary of Collaborating functions.
| Collaborating function | Brief explanation |
|---|---|
| Question-clarify | Questions that clarify something about what the speaker just said. |
| Question-confirmation | Questions that confirm meaning. They are distinguished by their preferred response being ‘yes’. |
| Correction | When a speaker suggests an alternative to what another said in the belief that they had made a mistake, it was classified as a correction. This category includes recasts. |
| Completing Sentences | When another participant steps in and provides the next words that the speaker may have wanted to say. This was double coded when a speaker used a raised intonation to ‘Clarify’ the meaning. |
| Suggest Words | Even when there is no obvious opportunity like a trailing sentence, sometimes speakers offered words that the speaker may have used. |
| Incomprehension | When a speaker admits to not knowing what was meant, or not knowing what to say, it invites others to collaboratively contribute by supplying an answer or clarifying meaning. |
| Respond to help | If a mistake was pointed out by another, then the recipient of this help could make the correction or repair, and if they did so it was coded as Respond to help. |
Appendix C: Descriptive statistics
Descriptive statistics for the scoring of the test.
| 1 |
2 |
3 |
4 |
5 |
6 |
|
|---|---|---|---|---|---|---|
| Pronun. | Fluency | Grammar | Vocabulary | Comm. Skills | Total | |
| Administration 1 | ||||||
| Median | 2.4 | 2.2 | 2.2 | 2.2 | 2.4 | 11.6 |
| Mean | 2.321 | 2.309 | 2.251 | 2.179 | 2.389 | 11.449 |
| Std Dev | 0.691 | 0.543 | 0.530 | 0.564 | 0.660 | 2.843 |
| Administration 2 | ||||||
| Median | 2.8 | 2.7 | 2.6 | 2.6 | 2.9 | 13.8 |
| Mean | 2.798 | 2.753 | 2.653 | 2.699 | 2.901 | 13.806 |
| Std Dev | 0.500 | 0.465 | 0.443 | 0.415 | 0.495 | 2.159 |
| Administration 3 | ||||||
| Median | 2.8 | 2.6 | 2.5 | 2.6 | 2.9 | 13.3 |
| Mean | 2.817 | 2.699 | 2.618 | 2.583 | 2.835 | 13.551 |
| Std Dev | 0.376 | 0.504 | 0.381 | 0.359 | 0.478 | 1.834 |
N = 53.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Hankuk University of Foreign Studies Research Fund of 2018.
