Abstract
Nonverbal behavior can impact language proficiency scores in speaking tests, but there is little empirical information of the size or consistency of its effects or whether language proficiency may be a moderating variable. In this study, 100 novice raters watched and scored 30 recordings of test takers taking an international, high stakes proficiency test. The speech samples were each 2 minutes long and ranged in proficiency levels. The raters scored each sample on fluency, vocabulary, grammar, and comprehensibility using 7-point semantic differential scales. Nonverbal behavior was extracted using an automated machine learning software called iMotions, and data was analyzed with ordinal mixed effects regression. Results showed that attentional variance predicted fluency, vocabulary, and grammar scores, but only when accounting for proficiency. Higher standard deviations of attention corresponded with lower scores for the lower-proficiency group, but not the mid/higher-proficiency group. Comprehensibility scores were only predicted by mean valence when proficiency was an interaction term. Higher mean valence, or positive emotional behavior, corresponded with higher scores in the lower-proficiency group, but not the mid/higher-proficiency group. Effect sizes for these predictors were quite small, with small amounts of variance explained. These results have implications for construct representation and test fairness.
Performance assessment scores can be impacted by a wide range of phenomena. Raters might focus on particular rating criteria more than others, for example, or their scores may show bias due to the background characteristics of test takers. Raters may also focus on contextual elements not present in rating scales (Lumley, 2002). Because interpreting the visual world is integral to most instances of spoken communication (Hall et al., 2019), it is possible that raters use nonverbal behavior as a source of implicit or even explicit information when awarding second language (L2) proficiency scores. Despite notable research providing qualitative evidence of this relationship (Jenkins & Parra, 2003; Neu, 1990; Sato & McNamara, 2019), estimations of the effect size of nonverbal behavior on language ability ratings are scarce. In this article, I describe a study that addressed this gap by measuring variance attributable to nonverbal behavior in the context of a direct, remotely delivered speaking test.
Background
Nonverbal behavior can convey a wide range of semantic, cognitive, affective, and interactional information (Hall et al., 2019). It aligns with language to strengthen and emphasize meanings, occurring spontaneously, automatically, and without attention (Buck & VanLear, 2002). Verbal and nonverbal channels of communication are intertwined on multiple levels: Producing talk involves visible breathing and articulating movements not only of the face and the mouth, but of the entire body; moreover, these articulatory movements are dissociable from other bodily conduct . . . both talk and gesture originate from the same process. (Mondada, 2016, p. 340)
Mondada (2016) explained that attempting to isolate language as an object of study or source of truth may lead to an overly restrictive logocentric focus on communication, neglecting critical semiotic information conveyed through nonverbal communication. Language—and, by extension, second language proficiency—cannot be fully understood devoid of context.
Nonverbal behavior may show acquisitional patterns at different proficiency levels, especially in the case of gesture (e.g., Gregersen et al., 2009). This fact led Stam (2008) to remark that “looking at learners’ gestures and speech can give us a clearer picture of their proficiency in their L2 than looking at speech alone” (p. 253). In one notable example of this interconnection, Gan and Davison (2011) investigated behaviors in a high- and low-proficiency group that took a group-based interactive speaking test in Hong Kong. Using multimodal conversation analysis, the authors found that the higher scoring group used co-speech gestures (i.e., gestures that coincided with speech rather than silence) to provide detailed lexical meaning, emphasize particular ideas and suggestions, and aid in interactional management. The test takers also integrated eye contact, facial expressions, and body posture into their interactional moves. The lower scoring group displayed markedly different nonverbal behaviors in their group interaction, with some interlocutors being fairly rigid with very few visible nonverbal behaviors, while others used gestures that did not align with speech and were self-adapting in nature (e.g., scratching hair).
Behaviors may also shift as a result of second language (L2) input, which may also reveal information related to language proficiency. For example, as L2 test questions increase in difficulty, test takers may spend more time averting their gaze from the examiner (Burton, 2023a). Research on L2 social interaction has described how L2 test takers visually display intersubjectivity (Burch & Kley, 2020), comprehension breakdowns (Seo & Koshik, 2010), and repair initiations (Burton, 2021) following input in their L2 as well. Raters notice these behaviors in testing contexts and remark on their close relationship with communicative competence (Sato & McNamara, 2019) and interactional competence (Ducasse & Brown, 2009; May, 2011).
Nonverbal behavior and test scores
Simply having access to visual nonverbal cues can enhance speech comprehension among interlocutors (Dahl & Ludvigsen, 2014; Drijvers & Özyürek, 2017). This facilitatory effect in speech comprehension may also explain higher scores in tests where the test taker is visible over audio-only recordings (Carey & Szocs, 2024; Choi, 2022; Nakatsuhara et al., 2021). Choi (2022), for example, investigated the score differences of 110 test takers on two asynchronous audiovisual recordings conducted on Zoom (with and without the interlocutor present) and audio-only recordings. She found that audio-only scores were lower than both video-recorded formats, and the two video formats were approximately equivalent. In a Rasch model, audio-only ratings were approximately 0.5 logits more difficult than the video recording with an interlocutor present, and 0.75 logits more difficult than the video recordings with the interlocutor removed. Similarly, Nakatsuhara et al. (2021) also found that audio-only ratings resulted in half-band lower scores on a 9-point IELTS rating scale. They speculated that the visibility of nonverbal behavior may explain score differences, as it helps raters “a) to understand what the test takers were saying, b) to comprehend better what test takers were communicating using non-verbal means . . ., and c) to understand with greater confidence the source of test takers hesitation, pauses, and awkwardness” (Nakatsuhara et al., 2021, p. 19).
Several patterns have emerged from studies that considered the impact of specific behaviors on raters’ perceptions of L2 speech. One such pattern is that attention, as expressed through eye gaze behavior, is salient to raters and may impact perceived language proficiency. Mutual eye gaze (i.e., when the test taker looks directly at the examiner or interlocutor) has been documented as having a positive impact on test scores, as it may indicate engagement, a desire to communicate, or greater listening comprehension; on the other hand, averted eye gaze is often perceived negatively, as it may indicate anxiety or breakdowns in fluency (Choi, 2022; Ducasse & Brown, 2009; Jenkins & Parra, 2003; May, 2009, 2011; Nakatsuhara et al., 2021; Sato & McNamara, 2019). Tsunemoto et al. (2022), in contrast, documented the potentially positive impact of gaze shifting between mutual and averted states, which the authors hypothesized may have enhanced the narrative quality of test takers’ utterances, thus alleviating some of the raters’ cognitive load while processing speech.
In a second pattern, overall expressiveness through the use of facial, head, and bodily behavior appears to have a beneficial effect on perceived language ability. For example, eyebrow raising and frowning may play a role in interaction management as well as in conveying prosodic features of speech (Kim et al., 2024; Neu, 1990; Tsunemoto et al., 2022). Head nods and tilts may convey engagement, listening comprehension, and aid in the management of interaction, leading to greater comprehensibility (Jenkins & Parra, 2003; May, 2009, 2011; Neu, 1990; Trofimovich et al., 2021). Forward leaning posture can convey engagement and comprehension, while leaning backward may convey confidence and low anxiety (Jenkins & Parra, 2003; Neu, 1990). On the other hand, the absence of these behaviors through overall inexpressiveness and rigidity may also have a correspondingly negative effect on proficiency outcomes (Gan & Davison, 2011; Jenkins & Parra, 2003; May, 2009, 2011; Neu, 1990).
The impact of expressiveness, though, may not impact all test takers uniformly. Jenkins and Parra (2003) highlighted the importance of overall expressiveness in a multimodal discourse analysis on eight test takers (international teaching assistants) in a paired-format L2 oral proficiency test. The test takers were rated on pronunciation, grammar, fluency, and comprehensibility, receiving composite scores ranging from 1 (low) to 4 (high), where 3 was defined as the threshold for passing. The eight scores ranged from 2.3 to 3.9. They found that some test takers with borderline-passing linguistic skills were able to compensate for their weaknesses by taking an engaged, actively communicative stance, which boosted their scores in comparison to test takers with rigid, inexpressive behavior. Behavior, then, served to “convince the evaluators that they have a higher level of proficiency than may in fact be the case from a purely linguistic perspective” (p. 100). The authors noted the importance of nonverbal behavior for less proficient test takers, as their nonverbal repertoires can help them compensate for linguistic difficulties and communicate more effectively. They noted that more proficient test takers are likely to benefit less from nonverbal behavior, as their linguistic skills are already sufficient to handle the demands of the speaking test. Nonetheless, it is difficult to extrapolate from this sample, as the test was designed for teaching assistants at a relatively high level of proficiency rather than L2 speakers with a broader range of language ability.
Finally, the affective quality of behaviors to convey positive emotions may also play a role in proficiency outcomes, though research is limited. Smiling, a behavior often associated with positivity, may lead to enhanced perceived fluency, greater engagement, lower perceptions of anxiety, and possibly greater comprehensibility (Jenkins & Parra, 2003; Kim et al., 2024; Neu, 1990; Thompson, 2016; Trofimovich et al., 2021). Nevertheless, smiling alone may be insufficient, as this behavior may convey a range of affective states. For example, Duchenne smiles, which combine mouth, eye, and eyebrow movement, have been documented as having a positive impact while non-Duchenne smiles, which only include movements of the mouth, may not (Thompson, 2016).
Research questions
The literature reviewed shows that nonverbal behavior is a key element of communication that interlocutors and raters notice when making judgments about L2 speakers’ language ability, but questions remain about whether these behaviors actually exert an influence on test scores. Although findings have shown that attention and expressiveness may improve impressions of speakers’ fluency and comprehensibility, the majority of this research has been through relatively small-scale qualitative interviews or stimulated recalls. The relationship between emotional valence (positive/negative behaviors) and rated outcomes is uncertain. There are few empirical studies demonstrating these relationships, and there are no empirical studies investigating judgments of vocabulary or grammar ability in relation to nonverbal behavior. The moderating variable of proficiency also deserves attention, as Jenkins and Parra (2003) speculated that less proficient test takers may benefit from the use of nonverbal behavior more than highly proficient test takers, especially those at borderline passing levels. No study to date, however, has investigated this effect. Thus, this study aims to fill these gaps by using a larger pool of rater participants and speech samples from a wider range of ability levels to analyze relationships between nonverbal behavior and language ability ratings and their interactions with proficiency. Such research is needed in the second language acquisition and language testing literature to expand a theoretical understanding of core aspects of language ability and to detail sources of score variance. These findings could then be used for interventions such as rater training and scale revision if large sources of variance are present.
Based on these findings, I formulated the following two research questions:
Does nonverbal behavior predict rated outcomes of language ability?
Do nonverbal behaviors impact rated outcomes differentially depending on the test takers’ base proficiency (prior standardized speaking test scores)?
Based on the literature reviewed, I hypothesized that (1) attention and engagement, but not valence, will result in significant positive regression coefficients of fixed effects in models; and (2) significant interaction coefficients of proficiency + attention and proficiency + engagement will indicate a moderating impact of nonverbal behavior by language proficiency level. Low-ability and high-ability speakers will be less impacted by behavioral measures than mid-range, borderline ability levels.
Method
This study was the central chapter of my dissertation investigating the impact of nonverbal behavior and affect on language test scores. The mixed-methods dissertation used multiple sources of data to uncover relationships between nonverbal behavior, perceptions of affect, and language ability ratings (Burton, 2023c). This study presents the analysis of automated measurements of nonverbal behavior as they related to these language ability ratings. Other data sources, including qualitative data from follow-up surveys and stimulated recall procedures, are not analyzed in this paper due to its reduced scope. During the planning phase of the dissertation, once I had decided on all methodological details, I preregistered the research questions, hypotheses, and method in the Open Science Framework (OSF). The preregistration and other open materials and data are publicly accessible in an OSF repository (Burton, 2023b).
Participants
I recruited 100 raters to take part in this study. All raters were first language (L1) English-speaking, American-born undergraduate students enrolled in a large public university in the United States. Place of birth and L1 were controlled to reduce variation in the interpretation or decoding of nonverbal behavior, as past research has shown notable differences in how nonverbal behavior is perceived in different cultural contexts (Matsumoto & Hwang, 2012). The mean age of raters was 20.92 years (SD = 1.48), and gender was balanced (52% female, 41% male, 6% other). Thirty-eight percent of the participants reported speaking an L2, and their distribution by school year was roughly even. Participants indicated a wide, diverse range of different majors across the university.
A key consideration in this study was to recruit participants that were largely novice raters or linguistic laypeople (Sato & McNamara, 2019). In other words, these were individuals without training or extensive experience in language teaching or language testing. The vast majority of judgments about L2 speakers’ language ability are made in informal settings in society by linguistic laypeople, such as when applying for a loan at a bank or being interviewed for a job. By sampling participants from this population (i.e., general society, undergraduate students studying various degree programs in higher education), it was anticipated that these individuals would draw on their own intuitions and understanding of language when interacting with others when rating, thus providing inferences about how language ability may be evaluated by society at large. Understanding how linguistic laypeople integrate nonverbal behavior into their judgments of L2 ability can lead to a greater understanding of general processes of L2 speech perception. Linguistic laypeople or novice raters have been used to investigate communicative processes both in applied linguistics (Sato & McNamara, 2019) and language testing research (e.g., Isbell et al., 2024). Using naïve raters may have resulted in less reliable or harmonious scores, but this variance was anticipated and desirable in this study, as it best reflects the way listeners in society may process visual input.
I determined the sample size using both a power analysis simulation and a reading of existing literature. Past literature suggested that larger second-level cases (i.e., raters) than first-level cases (i.e., speech samples) would provide greater power (Hox et al., 2018; Westfall et al., 2014). I found that a stimuli size of 30 and a participant size of at least 80 would have a power of .95 to detect regression coefficients of .2 for the nonverbal predictor variables, which I considered the smallest meaningful effect size. I over-recruited raters because I anticipated that some participants would need to be removed due to problematic ratings, such as careless responding, straightlining (i.e., marking all 7s), or acquiescence bias (i.e., marking socially desirable responses rather than their own personal views). I used Rasch person fit statistics, multivariate outlier estimation, and interrater correlations to detect less desirable ratings that could negatively impact the quality of the dataset, which resulted in the removal of 17 raters from the final dataset. Data cleaning procedures are reported in Supplement 1 in Burton (2023b).
Speech samples
The 30 speech samples I used, listed in Table 1, were Zoom recordings of operational IELTS (n.d.) speaking tests collected in Nakatsuhara et al. (2017). These speech samples were provided to me upon request from IELTS, and as proprietary material are not available in the OSF repository. The full dataset consisted of 46 files, and I used the procedure outlined in Supplement 2 in Burton (2023b) to select 30 samples for use in the study that had an even score distribution and range of expressiveness. The speakers in the speech samples were mostly female (23 females, 7 males) and all were from mainland China. Thus, the individuals in the speech samples were also controlled in terms of their demographic background, similar to the raters. The test takers’ scores that accompanied the speech samples were an average of the four IELTS (n.d.) criterion scores (i.e., subscores) rounded to the nearest half-band, based on the entire speaking test. The distribution of scores was even: eight speech samples were in the “low” group (3.5–4.5), 14 were “mid” (5–5.5), and eight were “high” (6–6.5). I selected and trimmed samples from the first half of the third section of the speaking test, which is a two-way discussion between a highly proficient English-speaking examiner and a test taker. In this section of the test, the examiner elaborates questions based on a previously discussed topic that allow the test taker to discuss abstract ideas and issues (IELTS, n.d.). I trimmed the samples to roughly 2 minutes (M = 2 m 11s, SD = 14s) so that raters would have to make quick impressions based on a small sample of language. Using truncated samples deviates from operational practice (in which the entire speaking test is evaluated), but this was a practical consideration in this study, as longer samples require a longer rating period and would have been unfeasible. The samples were named according to their test score ranking (rather than raw score), with Sample 1 (S01) being the weakest and Sample 30 (S30) tying for the strongest, as seen in Table 1.
Speech samples.
Each sample featured the test taker in a standardized setting. The test taker sat in front of a laptop computer with a camera embedded in the top bezel. All test takers’ heads were fully visible, with only their lower body obscured by the framing of the video. Lighting conditions were consistently bright in all videos, and the test takers’ facial features were fully visible. Although there was a small viewing angle difference when individuals looked at the computer screen rather than the camera, the overall impact was similar to mutual gaze (e.g., Burton, 2023a). The videos were recorded such that video quality was sufficient for automated analysis, and none of the videos included technical issues or interruptions.
Instruments
Rating scales
I built a set of semantic differential scales to tap into raters’ impressions of language ability. Semantic differentials are generally single-word adjectives or short descriptions which are paired with antonyms (e.g., engaged/unengaged, anxious/at ease) set on an ordinal scale. Semantic differentials are beneficial for evaluating performances as they are simple to understand and generally do not require training (Snider & Osgood, 1969). The language features I chose were those typically associated with language proficiency: fluency, vocabulary, grammar, and comprehensibility. Comprehensibility was chosen rather than pronunciation because of ongoing research int his area (e.g., Trofimovich et al., 2021; Tsunemoto et al., 2022). I also believed the novice raters would understand comprehensibility more clearly than pronunciation, as characteristics of pronunciation (e.g., phonemic control, appropriate stress) may be unfamiliar to linguistic laypeople, and they may have resorted to judgments of accent. The wording of the scale is presented in Table 2. Raters were provided brief definitions of these terms in the instructions to the study, also provided in Table 2, and were instructed to think carefully about their meaning prior to starting the study.
Semantic differential scale wording.
Each scale contained seven points. Although an even-numbered scale without a midpoint would have required raters to choose a scale direction (fluent/disfluent), I chose a scale with a midpoint to allow ratings that exhibited characteristics of both endpoints. Additionally, I chose a scale with seven points because of its desirable psychometric properties, as scales with 5 or fewer points may have attenuated precision (Simms et al., 2019). The polarity of the adjectives alternated so that for some scales the positive adjective was on the left, while for other scales, it was placed on the right. This was to prevent survey acquiescence bias, such as straightlining answers (all 7s) for all categories, as raters would have to carefully read each scale to know whether a 1 or a 7 was positive or negative. The scales also included a set of 10 affect judgments (e.g., confident/unconfident, anxious/at ease) for the raters to use after assigning language scores. The inclusion of affect scales was largely intended to encourage raters to watch and not only listen to the video samples, given that perceptions of affect are often based on facial behavior (Kappas et al., 2013). The analysis of affect scale results is beyond the scope of this paper, however, and I will focus only on the language outcomes in the analysis, though the full findings can be found in Burton (2023c). Prior to being used in the study, the scales were piloted with a separate set of raters and videos to verify they could be used meaningfully in the main study.
Rating platform
The scales were incorporated into an online rating platform I built using Qualtrics. An example of the platform is viewable in Supplement 3 in Burton (2023b). The platform included an introduction and description of the rating scales. Brief definitions were provided to clarify the rating scale categories of language ability, as seen in Table 2, but these were kept simple because it was desirable for participants to bring their own internal definitions of the terms to the study. Providing extended definitions may have introduced unnecessary confusion or difficulty with the task. A practice section immediately followed the introduction. Each participant viewed two videos, one of a higher-ability speaker and one of a lower-ability speaker. These practice videos were not illustrative of the lighting conditions of the main bank of videos, as I had recorded one video in a previous study. The raters were encouraged to consider the performance and rate the sample. Thereafter, raters received a description of the sample’s linguistic performance without reference to behavior in order to orient their attention toward general language performance. For example, after viewing the first practice video, the raters were shown the following text: “Although the speaker struggles to understand the question at first, overall her language is fairly strong once she begins speaking. She manages to communicate fairly effectively.” Raters were not provided feedback on their actual ratings or explicit benchmarked scores, as it was desirable for raters to maintain their own internal definitions of the scale categories.
The participants conducted the study on two different days (15 samples per day) separated by at least 24 hours. The 24-hour break was planned to reduce rater fatigue. The introduction and practice sections were available on both days of the study, but participants could skip the second practice. Participants rated each sample once in a random order. The speech sample stimuli were presented on separate pages from the rating scales, and the rating scales were presented only after the videos had finished. The samples and scales were separated to reduce distractions and to encourage participants to watch the video the entire time it was playing, as otherwise the raters may have looked through the rating scales while listening. The videos could not be paused, restarted, or downloaded. The videos were presented in the maximum size possible within Qualtrics to enhance the visual area of the videos. The rating study concluded with a follow-up survey not analyzed here but available in Supplement 3 in Burton (2023b).
iMotions
I analyzed the nonverbal behavior in the video speech samples using the facial expression analysis module within iMotions (Version 9.0; iMotions, 2017). iMotions is a behavioral analysis application that uses computer vision and machine learning algorithms to detect faces, automatically code facial expressions, and classify emotional states. It is able to measure head orientation, facial landmarks, action units, seven emotional states (joy, anger, surprise, fear, contempt, sadness, and disgust), and three omnibus measures of behavior: engagement, valence, and attention (iMotions, 2017). iMotions has been found to be more accurate than physiological methods of behavior detection such as facial electromyography (EMG; Kulke et al., 2020). Research has shown that automated facial expression analysis algorithms can provide measurements similar to human observations for prototypical expressions in posed environments with frontal face recording and sufficient lighting conditions (Beringer et al., 2019; Dupré et al., 2020; Kulke et al., 2020; Stöckli et al., 2018), though accuracy decreases in naturalistic conditions (Cross et al., 2023; Küntzler et al., 2021). Nonetheless, iMotions’ facial expression algorithm Affdex (Affectiva, n.d.) is regularly updated, and at the time of data collection in 2022, accuracy had improved over previous versions of the system (Affectiva, 2022). Although the videos were recorded in naturalistic conditions in this study, the test takers were presented facing the camera and in bright lighting conditions, which can improve the accuracy of behavior detection (iMotions, 2017).
I extracted engagement, valence, and attention for this study because each captured complex combinations of facial movements relating to features raters identified in the literature: expressiveness, positivity, and gaze direction. According to Affectiva (n.d.), engagement is a measure of overall facial muscle activation (expressiveness). Facial muscles contributing to this measure include eyebrow raising and furrowing, cheek raising, nose wrinkling, mouth movements, and chin raising. Valence, on the other hand, is a measure of the relationship between behaviors and positive or negative emotions. Valence is measured by smiling and cheek raising (positive valence) and brow raising/furrowing, nose wrinkling, mouth frowning, lip pressing, and chin raising (negative valence). Attention is a measure of gaze and head turns directed toward the stimulus source (the camera). Data output for these measures is in the form of a probability-based confidence score for each video frame. Engagement and attention are on scales of 0 (no behavior present) to 100 (behavior fully active), while valence is scaled from −100 (negative) to 100 (positive). For example, if a frame receives a value of 87 on engagement, the algorithm has classified this instance as highly likely to be expressive through facial muscle movements. The final output of each response is a table of probability measures for each frame of video analyzed.
Procedure
Following a recruitment survey that gathered demographic information, eligible participants were given access to the study. Participants indicated their consent and signed a non-disclosure agreement prepared by IELTS for the protection of proprietary data. The participants conducted the study in the location of their choosing, but they were asked to choose a quiet space. Twenty-four hours after completing the first day of the study, they were sent an e-mail with a link to the second day of the study. Each day of rating contained 15 randomized samples, and the days were counterbalanced to reduce order effects. At the end of the second day, the participants completed a follow-up survey and were compensated for their time. It took participants roughly 1 hour to complete each day of the study (Day 1: M = 61 m, SD = 18 m; Day 2: 62 m, SD = 20 m).
Statistical analysis
To analyze the outcomes, I first adjusted the polarity of all scales such that negative judgments (e.g., weak vocabulary, disfluent) were reordered with 1 being the lowest endpoint, and positive judgments (e.g., strong vocabulary, fluent) were reordered such that 7 was the highest endpoint. These four outcomes (fluency, vocabulary, grammar, and comprehensibility) were dependent variables in the study. I rescaled the IELTS scores (using the rescale function in R) to the same 1–7 scale as the proficiency judgments to be used as interaction terms. This scaling was only to improve the comparability and interpretability of the scores. I refer to these as base proficiency scores, as they were external and measured prior to this study.
I used the three iMotions raw score averages as predictors in this study. Raw scores averages were used because I was interested in the intensity of behavioral indices rather than changes relative to each sample. Raw scores are most appropriate when comparing individuals or groups (iMotions, 2017). I avoided thresholding output indices on either time or amplitude. Thresholding is used to determine the likelihood of the appearance of an individual’s behavior at a given confidence level, allowing researchers to investigate questions related to reaction time or behavior duration (iMotions, 2017). iMotions extracted 117,221 frames of data from the 30 speech samples at 30 frames per second, totaling 66 minutes of video. I averaged engagement, valence, and attention values for each speech sample for analysis. The means represented the overall strength of engagement, valence, or attention for each speech sample across the entire video.
However, aggregated mean scores only reflect part of the reality of behavior in naturalistic settings. Behavior may shift and change in communicative events according to the constraints of context. Indeed, Tsunemoto et al. (2022) noted that gaze shifting may have played a positive role in improving L2 comprehensibility. For these reasons, I decided to investigate behavioral variance in addition to mean behavior. This decision occurred after the study had been preregistered, making this an exploratory aspect of this study. In order to capture behavioral variance, I computed standard deviations for the three iMotions indices for each test taker. These standard deviations were thus a simple measure of relative shifts in behavior across each video.
I used cumulative logit mixed effects models to test for the relationships amongst the iMotions behavioral indices and the language variables. I used the clmm function in the ordinal package (v.2019.12–10) in R to account for random rater and sample variance. In each model, I entered variables by order of correlation with the dependent variable, creating four models including the null model. Each variable was entered in each model one by one:
All models used a logit link and flexible thresholds for the most accurate estimation of score probabilities. I also compared the four models with the interaction model of the three terms with base proficiency. I then selected the best fitting model based on comparisons using likelihood ratio tests. I also tested the final selected model against the same model with random effects removed to verify that random effects contributed meaningfully. Bonferroni corrections were applied to the four sets of analyses with a significance threshold of α = .0125. In all cases, the interaction models provided the best fit. Significant interactions were explored by entering the trichotomized grouping variable (see Table 1) as an interaction term, with the high-proficiency group as the reference group. I inspected these models for group differences, and explored these differences visually using scatterplots.
Most assumptions for the four models were met: The dependent variable was ordinal, the independent variables were continuous (iMotions scores), and there was no evidence of multicollinearity. The fourth assumption, that of proportional odds, was checked using Brant’s tests (Brant, 1990) using the brant package (v. 0.3–0) in R on models with random effects removed (polr models). This is due to the fact that there are currently no statistical software packages that are able to check for this assumption using clmm. Not all of the models supported the proportional odds assumption. However, this may not be problematic when estimating average odds ratios across samples/raters and when the aim of the study is not to predict discrete outcomes for individuals (Harrell, 2020). For this reason, I used cumulative logit models rather than comparable multinomial models, which would have hindered parsimony and interpretability. The data set and code sheet are available in the OSF repository as Supplements 4 and 5, respectively, in Burton (2023b).
Results
The participants used the full range of scale point categories when rating. Table 3 shows that the mean of each scale category was near the midpoint of 4, with the lowest scores in grammar (4.14) and the highest in comprehensibility (4.83). For interrater reliability, although alpha was relatively high for a rating scale without rater training or descriptors (.72–.85), the intraclass correlation coefficients (ICCs) showed that agreement amongst the raters was low to mid (.37–.58), as anticipated. Table 3 also shows polychoric correlations between the four scales, which were rather high (.67–.85), showing some overlap in construct coverage.
Descriptive statistics.
Rating patterns coincided similarly with the base proficiency scores (rpoly = .63), though with notable differences. Figure 1 shows the candidate identification number on the y-axis, ranging from S1 (the lowest scoring on the prior proficiency test) to S30 (the highest scoring on the prior proficiency test). Mean scores and associated confidence intervals are presented for each sample. For each rated language outcome of fluency, grammar, vocabulary, and comprehensibility, there were samples perceived as substantially stronger (e.g., S09, S17) or weaker (e.g., S25, S29) than their original proficiency ranking. Others were ranked similarly to the prior proficiency test scores. There was more deviation in the scores from the original proficiency ranking in the mid- and high-proficiency groups than the low-proficiency group.

Distribution of scores by participant.
Descriptive data about the iMotions variables is available in Supplement 6 in Burton (2023b). Several patterns emerged from the data, summarized in Table 4. For one, engagement scores showed that test takers displayed a rather minimal amount of facial muscle activation as a whole (M = 27.60, scale ranging from 0 to 100), but there was a large amount of variance in expressiveness across samples (SD = 30.22). In contrast, valence (measured on a scale ranging from -100 to 100) had a mean that was near 0, indicating that test takers did not exhibit clear tendencies of positive or negative emotions as a whole, but there was substantial variance in the group (SD = 23.95). For attention, test takers largely directed their gaze and head turns toward the camera (M = 91.07, scale ranging from 0 to 100), but exhibited less variance doing so (SD = 13.15).
Predictor means and polychoric correlations with language ability.
Note: *p < .0125, brackets indicate 95% confidence intervals of polychoric correlations.
The polychoric correlations listed in Table 4 show two main patterns. For one, fluency, vocabulary, and grammar correlated similarly with the predictor variables, with small, significant correlations (Plonsky & Oswald, 2014) with mean valence, the standard deviation of valence, and the standard deviation of attention. Comprehensibility showed a different correlational pattern. It also correlated with mean valence and the standard deviation of attention, but it did not correlate with the standard deviation of valence. In addition, it was the only outcome measure that correlated with engagement (both mean and SD).
Ordinal regression of behavioral means
Similar to the correlation analysis above, ordinal regression revealed a pattern whereby model estimates for fluency, vocabulary, and grammar were nearly identical, only differing in the strength of the regression coefficients and odds ratios. Because these were so similar, I report statistical tables for fluency, vocabulary, and grammar but only describe in text in full the results for fluency. Comprehensibility showed different patterns and is described separately.
Fluency, vocabulary, and grammar
For fluency, the best fitting model was the interaction model with base proficiency included, χ 2(4) = 50.05, p < .001, shown in Table 5. This model also fit significantly better than the model with random effects removed, χ2(2) = 712, p < .001. The only significant predictor in this model was base proficiency, which had a sizeable relationship with fluency, β = 2.27, odds ratio = 9.66. This model, however, only explained minimal variance in the model, Nagelkerke’s pseudo R2 = .02. Similar patterns were found for vocabulary and grammar in Tables 6 and 7. Base proficiency was the sole predictor of vocabulary (β = 1.90, odds ratio = 6.70) and grammar (β = 1.43, odds ratio = 4.19), explaining minimal variance in these models, Nagelkerke’s pseudo R2 = .02.
Interactions between base proficiency and mean behavioral indices on fluency.
Note: *p < .0125. Prof = base proficiency. OR = odds ratio. 95% CI = confidence interval of odds ratio.
Interactions between base proficiency and mean behavioral indices on vocabulary.
Note: *p < .0125. Prof = base proficiency. OR = odds ratio. 95% CI = confidence interval of odds ratio.
Interactions between base proficiency and mean behavioral indices on grammar.
Note: *p < .0125. Prof = base proficiency. OR = odds ratio. 95% CI = confidence interval of odds ratio.
Comprehensibility
Comprehensibility showed a markedly different pattern. The interaction model again fit the data best, χ 2(4) = 48.00, p < .001, which also fit significantly better than the model with random effects removed, χ2 (2) = 697.42, p < .001. However, as shown in Table 8, the interaction between mean valence and base proficiency was a significant predictor, β = −0.02, odds ratio = 0.98. Similar to the other models, this model explained minimal variance in the score outcome, Nagelkerke’s pseudo R2 = .02.
Interactions between base proficiency and mean behavioral indices on comprehensibility.
Note: *p < .0125. Prof = base proficiency. OR = odds ratio. 95% CI = confidence interval of odds ratio.
A post hoc analysis using three proficiency groups (outlined in Table 1) as the interaction term showed that the estimate of the low-proficiency group in comparison with the high-proficiency group was significantly different for the group interaction with mean valence (β = 0.09, p = .015, odds ratio = 1.09), which is a small effect size. The mid-proficiency group’s estimate was not significantly different from that of the higher proficiency group (β = 0.06, p = .22, odds ratio = 1.06). Figure 2 visualizes the interaction effect. This figure and positive estimate show that the lower-scoring group’s comprehensibility scores corresponded positively with mean valence. In other words, displaying overall more positive behaviors corresponded with greater ease of comprehension in the lower-proficiency group. There was no noticeable effect of mean valence on the mid-proficiency group. The high-proficiency group, however, showed a negative correspondence between positive valence and comprehensibility scores, though this was not statistically different from that of the mid-proficiency group.

Interactions between mean valence and base proficiency on comprehensibility.
Ordinal regression of behavioral variance
Fluency, vocabulary, and grammar
Models were built with the standard deviations in the same method as with the means. As with the means, fluency, vocabulary, and grammar had similar results, and thus I report the tables and figures for all models but only describe fluency in full. The best fitting fluency model was the interaction model, χ2(4) = 53.48, p < .001, which also fit significantly better than the model with random effects removed, χ 2(2) = 647, p < .001. This model, presented in Table 9, contrasted with the mean model in that the interaction of attention with base proficiency was significant, β = 0.02, p = .003, with a very small effect size (odds ratio = 1.02). The main effect of attention was also significant, β = −0.06, p = .01. This model explained minimal variance in the outcome, Nagelkerke’s pseudo R2 = .03. Similar patterns were found for vocabulary and grammar in Tables 10 and 11. The interaction of attention with base proficiency was the sole predictor of vocabulary (β = 0.01, odds ratio = 1.01) and grammar (β = 0.01, odds ratio = 1.01), explaining minimal variance in these models (vocabulary R2 = .02, grammar R2 = .03).
Interactions between base proficiency and behavioral index SDs on fluency.
Note: *p < .0125. Prof = base proficiency. OR = odds ratio. 95% CI = confidence interval of odds ratio.
Interactions between base proficiency and behavioral index SDs on vocabulary.
Note: *p < .0125. Prof = base proficiency. OR = odds ratio. 95% CI = confidence interval of odds ratio.
Interactions between base proficiency and behavioral index SDs on grammar.
Note: *p < .0125. Prof = base proficiency. OR = odds ratio. 95% CI = confidence interval of odds ratio.
A post hoc analysis using three proficiency groups (outlined in Table 1) as the interaction term showed that the estimate of the low-proficiency group in comparison with the high-proficiency group was significantly different for fluency (β = −0.08, p = .007, odds ratio = 0.92), which is a small effect size. There were analogous findings for vocabulary (β = −0.06, p = .019, odds ratio = 0.94) and grammar (β = −0.06, p = .008, odds ratio = 0.95). The mid-proficiency group’s estimate was not significantly different from that of the higher proficiency group in fluency ( β = −0.02, p = .44, odds ratio = 0.98), vocabulary (β = −0.0098, p = .70, odds ratio = 0.99), or grammar (β = −0.01, p = .49, odds ratio = 0.99). Figure 3 visualizes the interaction effect of attentional variance with base proficiency for fluency, vocabulary, and grammar. This figure shows that the lower-scoring group’s scores corresponded negatively with variance in attention. That is, changes between directed and undirected attention corresponded with lower perceptions of fluency, vocabulary, and grammar in the lower-proficiency group. Both the mid- and high-proficiency groups, however, showed a positive correspondence between attentional variance and these language outcomes.

Interactions between attention SD and base proficiency on fluency, vocabulary, and grammar ratings.
Comprehensibility
The best fitting model for comprehensibility was the interaction model, χ2(4) = 46.25, p < .001. This model, shown in Table 12, also fit significantly better than the model with random effects removed, χ2(2) = 692.65, p < .001. In this model, however, none of the predictors were significant. This model explained minimal variance in the outcome, Nagelkerke’s pseudo R2 = .02.
Interactions between base proficiency and behavioral index SDs on comprehensibility.
Note: p < .0125. Prof = base proficiency. OR = odds ratio. 95% CI = confidence interval of odds ratio.
Discussion
The aim of this project was to investigate whether automated measurements of nonverbal behavior exhibit a relationship with linguistic laypeople’s perceptions of L2 ability. L1-English, American-born undergraduate students watched and listened to short recordings of L2 English test takers from China taking a remote form of the IELTS speaking test. The naïve raters scored each sample on fluency, vocabulary, grammar, and comprehensibility. I measured nonverbal behavior using iMotions, which produced indices of engagement (facial muscle activation or expressiveness), valence (positive/negative directionality), and attention (mutual gaze and head orientation toward the camera), which I then used to model effects on rated language outcomes. I used the rescaled IELTS speaking test scores as an interaction term in these models. I investigated the impact of average values of the behavioral measures for each sample, but I also conducted an exploratory analysis using behavioral variance through standard deviations. The findings showed that variance in attention and overall positive/negative facial expressions subtly influenced raters’ perceptions of individuals’ L2 ability, but the effects of nonverbal behavior were not uniform across all ability levels.
Does nonverbal behavior predict rated outcomes of language ability?
Although correlations suggested weak relationships between mean valence and engagement and the language ratings, the regression models showed that nonverbal behavior did not have a uniform impact on language ability ratings in this sample. Valence, for instance, was the only behavior that correlated with fluency, vocabulary, and grammar (rpoly = .07–.12.), while both valence and engagement correlated positively with comprehensibility (.18 and .11, respectively). Attention did not correlate with any of the outcomes, contrary to expectations. In the regression analyses, average attention, engagement, and valence were not significant main effects of fluency, vocabulary, grammar, or comprehensibility. In other words, nonverbal behavior did not impact language ratings across the whole sample. Thus, the first hypothesis, that average attention and expressiveness would predict proficiency outcomes across all test takers, was refuted.
The finding that average behavioral measures did not show a relationship with proficiency outcomes across all test takers is not supported by the literature. Studies have attested that more open, outgoing, and expressive test takers may be seen as more proficient overall (Jenkins & Parra, 2003; May, 2011; Neu, 1990) and may display greater fluency (Kim et al., 2024; Tsunemoto et al., 2022). This is also contrary to Trofimovich et al. (2021), who found a partial positive correlation between positive affective behavior (e.g., smiling) and comprehensibility across their sample. Likewise, attention (as mutual gaze) relates to positive impressions of test takers, while averted gaze exerts a negative influence (Choi, 2022; Ducasse & Brown, 2009; Jenkins & Parra, 2003; May, 2011; Nakatsuhara et al., 2021; Sato & McNamara, 2019). Nonetheless, apart from Kim et al. (2024) and Tsunemoto et al. (2022), these studies were based on rater reports. What raters notice and what actually impacts their ratings may differ slightly. It may also be the case that using ensemble measures of behavior (e.g., attention includes gaze and head turns) rather than discrete measures (e.g., gaze direction), may obscure effects. This could explain the effects found in Kim et al. (2024) and Tsunemoto et al. (2022), where discrete counts of eyebrow movement, smiling, and gaze direction exerted measurable effects on fluency and comprehensibility. This highlights a need to isolate specific behaviors, including a broader range of gesture and body movements, for future research. However, as I will discuss next, it is also possible that nonverbal behavior affects the perception of language ability in different directions depending on underlying ability profiles, essentially canceling out main effects.
Do nonverbal behaviors impact rated outcomes differentially depending on the test takers’ base proficiency (prior standardized speaking test scores)?
In the four regression models for fluency, vocabulary, grammar, and comprehensibility, the models including an interaction term of base proficiency (the original IELTS scores awarded to the test takers in the samples) fit the data the best. Nonetheless, of the four models, the only significant interaction effect was between mean valence and base proficiency on comprehensibility ratings. Mean behavioral interaction terms did not predict fluency, grammar, and vocabulary outcomes, even though correlations with valence were positive. These findings thus did not support the a priori hypothesis to this research question. The findings were also contrary to Jenkins and Parra (2003), who posited that borderline passing test takers may benefit more from being expressive than stronger test takers. Instead, the post hoc analysis revealed that lower proficiency test takers exhibiting more positive behaviors (rather than merely being expressive) experienced a small benefit in their comprehensibility ratings. This indicates that seeing more positive behaviors may have helped raters understand the test takers more clearly. It was unclear whether the mid- or high-proficiency groups’ scores differed across valence levels, as these groups were not significantly distinct in their outcomes. That being said, it is interesting that the high-proficiency group appeared to have a negative relationship with mean valence. This pattern may have been influenced by an outlying proficient yet anxious test taker, and as such I hesitate to explain this particular effect further. The effect size of these relationships was very small, though, explaining only 2% of the variance. Thus, any benefit of positive behaviors on ease of understanding is subtle, at least as behaviors are defined and measured by iMotions.
The fact that the mid- and high-proficiency groups did not show a benefit of nonverbal behavior on their comprehensibility measures, as well as fluency, vocabulary, and grammar, is likely because their language skills were already sufficient to be understood clearly, as hypothesized by Jenkins and Parra (2003). Regarding the low-proficiency group, although Jenkins and Parra (2003) found that borderline candidates could benefit from overall expressiveness, which I did not find in this dataset, it is possible that the communicative-forward expressive behaviors the authors discussed may have demonstrated positive valence. If this were the case, it may help to explain the finding that valence predicted comprehensibility ratings in this group. Rather than expressiveness alone, comprehensibility may be more susceptible to positive affect in low proficiency speakers because it shows engagement and interest in the conversation. This could stimulate corresponding interest in the listener, leading to a willingness to listen more closely and thus facilitating understanding. Negative affect might have the opposite effect, causing listeners to disengage from speakers. These findings would be related, but not identical, to those of Nagle et al. (2022), who found that lower anxiety and greater collaborativeness served as predictors of comprehensibility scores. Nonetheless, in their study, those effects were consistent across the sample with speakers of different ability levels.
Relationship between variance in behavior and language ability
In addition to the preregistered hypotheses, I also explored whether behavioral variance—measured through standard deviations—would impact proficiency outcomes. I again found that the models with interaction terms fit the data best. In these models, comprehensibility outcomes were not predicted by main effects or interaction effects of variance in valence, engagement, or attention. However, the interaction between attentional variance (the amount of shifting between attention toward and away from the interlocutor) and base proficiency was significant for fluency, vocabulary, and grammar. A post hoc analysis of these effects revealed that lower proficiency and mid/higher proficiency candidates showed distinct patterns. A greater amount of shifting attention related to lower scores in the lower-proficiency group, while the opposite was true for the mid- and higher-proficiency groups. That is to say, as lower proficiency test takers more frequently directed their attention back and forth to the interlocutor, they were seen as weaker in fluency, vocabulary, and grammar. On the other hand, mid and higher proficiency candidates benefited from similar attentional patterns. Effect sizes, nonetheless, were again small.
While this finding may not be immediately intuitive, it is important to consider the various roles of gaze, which is a major component of attentional focus. Gaze is a complex behavior that may change according to various cognitive and affective states, social cues, and pragmatic needs. Shifting between mutual and averted gaze is an important aspect of interactional moves in speech, with proficient speakers showing uptake of information and initiation of turns with the breaking of gaze, and gaze returning to the interactant when turns are complete (Goodwin, 1980). Less proficient speakers may break gaze when questions are more difficult (and perhaps not understood) as a way to wrangle additional cognitive resources (Burton, 2023a; Doherty-Sneddon & Phelps, 2005). In language testing contexts, less proficient speakers have been found to use ensembles of behavior (in particular, gestures) that are more frequent, more irrelevant to the content of their utterances, and narrower in range, while more proficient speakers may use more integrated verbal-nonverbal utterances (Gan & Davison, 2011). Irrelevant, non-target-like gaze patterns may also become salient and informative to raters, thus indicating a speaker with lower language proficiency.
In this study, it is possible that stronger speakers who varied their attention in an integrated way with their utterances appeared even more proficient. Varied attention, then, added to their overall impression of language ability, as they were able to use attention as a tool at their disposal to manage the interaction with the examiner, while the opposite was true for the group with lower proficiency. These differential effects, however, largely go against previous findings that averted gaze has a uniformly negative impact on raters (Choi, 2022; Ducasse & Brown, 2009; Jenkins & Parra, 2003; May, 2011; Nakatsuhara et al., 2021; Sato & McNamara, 2019). Thus, these findings suggest that attention has a far more complex relationship with L2 speaking test scores than hypothesized: The amount of focused attention may not be as important as how frequently or where breaks in attention occur. More research is needed to confirm this finding, especially research using eye-tracking tools.
Implications
This study has shown that positive behaviors (such as smiling) and shifts in attention may subtly influence novice raters’ perceptions of L2 speech. Either explicitly or implicitly, listeners pick up on visual cues that ultimately factor into the decisions they make, which may reflect the way individuals in society perceive L2 speakers. Had these interactions occurred in real world settings, one could expect that lower proficiency speakers shifting their attention less and showing a more friendly demeanor may be somewhat more likely to be understood and thus possibly achieve goals related to their language use. The question remains as to whether these effects would be present in operational test settings where trained raters, rather than linguistic laypeople, use standardized rubrics to score tests. Trained raters are likely to focus more closely on language-related features and to score more consistently than linguistic laypeople. They use rubrics that detail a progression of language features generally without reference to nonverbal behavior. Whether behavior would influence trained raters’ scores is currently unknown, but given that audiovisual test input tends to be scored higher than audio-only input in operational settings (Carey & Szocs, 2024; Choi, 2022; Nakatsuhara et al., 2021), it is likely that nonverbal behavior still plays a role in trained raters’ perceptions of language ability and may influence score outcomes.
Even if this score variance appears in operational settings, another question that this study raises is whether nonverbal behavior should be accounted for in the test construct. Variance of the size reported here is unlikely to lead to major changes in final scores, especially when using an ordinal scale that lacks the granularity to reflect minor differences in language ability. However, for candidates with scores very close to score boundaries, raters may rely on external criteria to help consolidate their decisions (Lumley, 2002). This was indeed the case in Jenkins and Parra (2003). In terms of fairness and equity, any change in language test scores that is unattributable to the construct (in this case, language without the contributions of nonverbal behavior) deserves attention. It may mean that there is either construct underrepresentation or construct-irrelevant variance, depending on how the construct is formulated. Underrepresenting the test construct can be problematic if it disadvantages test takers by not accounting for their repertoires of communication that go beyond those that are verbal only. Underrepresentation may also be a problem from an ontological standpoint as well, as it implies a logocentric view of language that fails to capture the full panorama of communication (Mondada, 2016). On the other hand, nonverbal behavior that is attributable to personality traits or underlying neurodiverse backgrounds rather than language proficiency could be harmful to integrate into a construct; at the same time, if raters almost certainly integrate nonverbal behavior into their internalized, idiosyncratic rating criteria, there is an additional risk of not addressing behavior if it leads to unfair biases. Furthermore, there are many situations in which audio-only communication is common, such as telephone calls, so careful articulation of the context in which communication occurs is necessary. It may be the case that nonverbal behavior could be included in rating scales for certain test purposes, or alternatively as part of a rater training program to reduce bias, but research will be needed to ascertain whether this is feasible for raters to attend to without introducing unnecessary complexity. Although the small effects found in this study likely do not support a full revision of test constructs and rubrics to include nonverbal criteria, they will hopefully serve as an impetus for further research on the impact of behavior on test scores and possible interventions through rater training.
Limitations
This study comes with a number of limitations. Nationality and L1 background were controlled with the participants and samples in this study, so these findings may not generalize outside of these boundaries. As one anonymous reviewer pointed out, even within these backgrounds there may be variance among, for example, speakers from the north or south of the USA and China. Given differences in cultural norms surrounding the interpretation of behavior (Hall et al., 2019; Matsumoto & Hwang, 2012), certain behaviors may be perceived differently depending on the backgrounds of the speakers and interlocutors involved. Another limitation is that this study used a relatively small sample of performances, which, though diverse in proficiency levels, would be optimally expanded to include different cultures, L1s, and ranges of behavior. These larger datasets would provide more insight into the complex nature of behavior and speech perception.
There were methodological limitations of this study as well. Regarding the software used, which automatically generated behavioral data, validation evidence is still relatively scarce. Although studies have shown high accuracy in detecting behaviors in posed settings (e.g., Beringer et al., 2019; Dupré et al., 2020; Kulke et al., 2020; Stöckli et al., 2018), there are ongoing questions of the accuracy of behavior detection in naturalistic settings as these algorithms improve. There are also questions about training sets that are used, as these could introduce bias in the accuracy of behavior detection if not constructed carefully (Cross et al., 2023). However, as noted before, Affectiva (2022) countered many of these arguments by demonstrating that the system is regularly updated to improve its classification accuracy and reduce bias. Another limitation is that other behaviors beyond facial expressions, such as gestures, posture, head nods, and shoulder movements, are salient to raters as well (Ducasse & Brown, 2009; Gan & Davison, 2011; Jenkins & Parra, 2003; May, 2009, 2011; Neu, 1990; Sato & McNamara, 2019), and can add important information that aids comprehensibility (Tsunemoto et al., 2022). All of these behaviors were visible to some degree in the sample videos, highlighting the need to explore a broader range of sources of score variance.
In terms of the study design, this study used short 2-minute samples of speech as the basis of the language ability judgments, which may be excessively short when formulating a more accurate impression of language ability. Indeed, the base proficiency scores, based on the standardized proficiency test, were approximately 14 minutes in length each. It is unknown whether the impact of behavior would be sustained across longer interactions with speakers. The use of semantic differentials, while useful for intuition-based research, can also be a limitation because raters are expected to draw from their own internal definitions of terms used. In this particular case, raters may have used broad or narrow definitions of fluency (Lennon, 1990), for example, or considered comprehensibility the amount of test takers’ listening comprehension, despite the definitions I provided in the instructions. Future research could improve the generalizability of these findings to operational rating sessions by using trained raters, rating rubrics, and longer samples.
Conclusion
Non-linguistic, nonverbal elements of communication have been an “elephant in the room” for decades, as testing practitioners witness these effects in operational settings, but the score impact has largely been unknown or disregarded as “noise.” This study has shown that nonverbal behavior exerts an effect—albeit small—on perceived speaking ability. It is unknown whether trained raters in operational settings would exhibit the same rating patterns, but these results may provide some explanatory evidence for score differences as a result of audio-only/audio with video modality (Choi, 2022; Nakatsuhara et al., 2021). Although changes in behavior are unlikely to cause major score differences when language is the basis of rating, even small changes in scores can have major consequences for test takers. For this reason, studies of this phenomenon are valuable for language testing and second language acquisition research and practice.
The findings in this study should be understood as preliminary. More research is needed to replicate these findings, to triangulate them with other methods and data sources, and to extend them using broader participant pools and contexts. As the availability and efficacy of technology make the study of nonverbal phenomena more feasible, it will be possible to develop a more detailed and nuanced understanding of L2 ability. It is clear that the nature of L2 ability encapsulates more than just linguistic elements, and, as language testers, it is our responsibility to determine what these elements are and to control or account for them in our test constructs whenever possible.
Supplemental Material
sj-pdf-1-ltj-10.1177_02655322241255709 – Supplemental material for Evaluating the impact of nonverbal behavior on language ability ratings
Supplemental material, sj-pdf-1-ltj-10.1177_02655322241255709 for Evaluating the impact of nonverbal behavior on language ability ratings by J. Dylan Burton in Language Testing
Footnotes
Acknowledgements
The author would like to thank the guest editors Dan Isbell and Benjamin Kremmel, the anonymous reviewers, and editors Talia Isaacs and Xun Yan for their comments and feedback, which have improved the manuscript over its initial version. The author would also like to thank Paula Winke, India Plough, Aline Godfroid, Koen Van Gorp, and Ryan Bowles for their support throughout this project from concept to completion.
Author contributions
Declaration of conflicting interests
The author declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: The author served as editorial assistant for Language Testing journal when this paper was submitted, but he finished his tenure at the journal prior to peer review. He additionally declared having worked for the British Council and IELTS from 2015 to 2019.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study, was was conducted as part of the author’s doctoral thesis conducted at Michigan State University, was funded by a British Council Assessment Research Award, a Duolingo Doctoral Dissertation Award, and a TIRF Doctoral Dissertation Grant.
Open practice
This article has received badges for Open Data and Preregistration. More information about the Open Practices badges can be found at
.
Supplemental material for this article is available online.
An OASIS (accessible) summary of this article authored by Dylan Burton is available at https://oasis-database.org/concern/summaries/2z10wr45g?locale=en. In addition, a video abstract is available on the Language Testing YouTube channel at
.
Data availability statement
Materials, data, code, and the preregistration are available in the OSF repository at https://osf.io/sk726/ (Burton, 2023b). The speech samples used in this study are not available from the author as these are proprietary IELTS data.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
