Abstract
As automated scoring systems for spoken responses are increasingly used in language assessments, testing organizations need to analyze their performance, as compared to human raters, across several dimensions, for example, on individual items or based on subgroups of test takers. In addition, there is a need in testing organizations to establish rigorous procedures for monitoring the performance of both human and automated scoring processes during operational administrations. This paper provides an overview of the automated speech scoring system SpeechRaterSM and how to use charts and evaluation statistics to monitor and evaluate automated scores and human rater scores of spoken constructed responses.
Keywords
Language testing organizations in the United States must routinely deal with large populations, especially for certain Asian, European, and Middle-Eastern countries. While having large populations is certainly not exclusive to language testing, constructed response (CR) item scoring, including essay scoring, and spoken response scoring, is definitely an added complication for scoring. Human scoring has its limitations, such as severity/leniency, scale shrinkage, inconsistency, halo effect, and rater drift (Engelhard, 1994, 2002). Without careful monitoring (Wang & Yao, 2013), the human rater effects may substantially increase the bias in students’ final scores. Human scoring is very labor intensive, time consuming, and expensive (Zhang, 2013). The importance of these language tests for relatively high-stakes decisions places a lot of pressure on the entire system to ensure accurate scoring and consistent ratings.
Automated scoring capabilities such as e-rater® and SpeechRaterSM have been developed and have the potential to provide solutions to some of the obvious shortcomings in human scoring (e.g., rater inconsistency, rater drift, and inefficiency). Bennett and Bejar (1998) indicated that automated scoring procedures allow for the scoring rules to be applied consistently. Automated scoring has some advantages including “fast scoring, constant availability of scoring, lower per unit costs, greater score consistency, reduced coordination efforts for human raters, and potential for a degree of performance specific feedback” (Ramineni, Trapani, Williamson, Davey, & Bridgeman, 2012). Therefore, some operational programs have already started using automated scoring to be used in combination with human scorers. The decision of these programs to use automated scoring was based on some research studies (Attali, 2007; Attali, Bridgeman, & Trapani, 2010; Attali & Burstein, 2006; Burstein & Chodorow, 1999; Chodorow & Burstein, 2004; Ramineni et al., 2012; Wang & von Davier, 2014).
SpeechRater is an automated scoring engine developed at Educational Testing Service (ETS) that has been used for a practice program since 2006. It consists of an automatic speech recognition (ASR) system, feature computation modules, and a multiple regression scoring model to predict scores for each spoken response (Zechner, Higgins, Xi, & Williamson, 2009).
The speaking construct that SpeechRater intends to measure is related to the notion of “communicative competence,” as described by Bachman (1990) and Bachman and Palmer (1996). The construct is operationalized as a set of rubric dimensions (used for human scoring of spoken responses) that cover various aspects of spoken proficiency, including fluency; pronunciation; prosody; vocabulary range and sophistication; grammatical accuracy and complexity; content; and aspects of discourse.
In recent years, the coverage of the speaking construct has been substantially extended from its original focus on fluency, pronunciation, and prosody by adding features related to vocabulary, grammar, and content, among others (Chen & Zechner, 2011; Xie, Evanini, & Zechner, 2012; Yoon & Bhat, 2012; Yoon, Bhat, & Zechner, 2012).
There are a few studies on effective quality control procedures in these types of language testing settings (Bejar, 2011; Wang & von Davier, 2010; Williamson, Xi, & Breyer, 2012), including one related to human ratings and automated essay evaluations (Bridgeman, 2013). Wang and von Davier (2010) have proposed a set of statistics and a framework (examinee, test, prompt, and rater level) to monitor the quality of CR scoring. Bejar (2011) has provided a quality control and assurance framework for automated scoring. Williamson et al. (2012) have provided a framework for automated scoring evaluation and use of automated scoring and guidelines for implementation and maintenance in the context of constantly evolving technologies. Bridgeman (2013) summarized some of the procedures for monitoring and evaluating the quality of essay ratings using both human and automatic scoring. Some researchers (Lee & von Davier, 2013; Luecht, 2010) have proposed using quality control techniques to monitor scoring, equating, and reporting of test scores. Lee and von Davier (2013) and Bejar (2011) also recommended using quality checking methods from other disciplines to monitor data routinely.
The main focus of the present research is to investigate whether human rater scores and the upgraded SpeechRater engine scores are comparable for each speaking item and administration. We used multiple methods including some quality control procedures such as traditional item analysis, agreement statistics, and graphical techniques to address this main research question. We also looked at scoring differences across different groups of test takers’ native languages since fairness across subgroups of testing population is an important consideration when deploying any assessment. The deployment of automated scoring technology needs to be done with fairness in mind, that is, treating test takers of different subgroups in the same way.
Research questions
The major research question in this validity study is whether the newly upgraded SpeechRater engine produces scores that are comparable to human raters. The current study targeted the following three specific research questions:
Are the ratings from human raters and SpeechRater consistent in severity and variability?
Can SpeechRater be used to identify human raters who are very strict or very lenient?
Do SpeechRater scores of different language groups differ in the same way as the scores assigned by human raters?
General considerations on automated speech scoring
This section provides some general background for the research presented in our paper. We will first compare and contrast automated scoring with human scoring of constructed responses in general, and of spoken responses in particular, and then provide a brief overview on the history of automated speech scoring.
Automated versus human scoring of constructed response items
Constructed response (CR) items are typically elicited from test takers in order to provide evidence of a certain proficiency, (e.g., being able to write a concise essay on a given topic, or to summarize a video lecture by using speech). When human raters assign scores to such CRs, they usually follow a pre-defined rubric, a set of band descriptors for each score level, indicating the typical characteristics of a CR for a particular score along several dimensions of a construct. For example, a spoken response to a prompt asking the test taker to describe their last summer vacation may be evaluated based on dimensions of speech such as fluency, pronunciation, prosody, vocabulary usage, grammatical expression, and content. Human raters need to be trained and calibrated to ensure they apply the rubric consistently across a large set of CRs in an assessment. However, in practice human ratings are not perfectly consistent, as discussed above.
Automated scoring systems usually identify various objectively measureable aspects of CRs, such as the rate of speech or the correctness of grammatical expressions in a spoken response, and then compute a score for a given CR by means of a weighted combination of these features, for example, by using a linear regression model that is trained on human-scored data. To the extent that the features computed by an automated scoring system are good representations of the dimensions of the construct that is associated with the response, such a scoring system is able to generate substantively meaningful scores. However, there has also been criticism that human raters are used as a “gold standard” to train automated scoring systems, even though it is known that they are far from perfect (Bennet & Bejar, 1988).
Obvious advantages of automated scoring systems are their perfect consistency, usually lower cost and faster scoring time, as well as the fact that it can be explained in detail what such a system is measuring, whereas this is much less obvious when using human raters for CR scoring.
Brief overview on the history of automated speech scoring
As speech recognition technology made substantial advances in the 1980s, researchers started to consider whether and to what extent this new technology could be used to evaluate the English proficiency of non-native speakers. The earliest systems focused on aspects of pronunciation and fluency (e.g., Bernstein et al., 1990; Cucchiarini et al., 1997a, 1997b, 2000a, 2000b; Franco et al., 2000b), and currently, the detection of pronunciation errors (e.g., for the purpose of language learning and tutoring systems), is still predominant in the field (e.g., EduSpeak, Franco et al., 2000a, 2010). Subsequently, this technology was used in various spoken language assessments with the emphasis on low-level aspects of speech, such as speaking rate, flow, hesitation, and pronunciation (Bernstein, 1999; Bernstein et al., 2000; Cucchiarini et al., 2002).
However, using only these types of low-level features that address delivery aspects of speech (such as fluency and pronunciation) does not capture the entire range of linguistic expression and representation that human raters will expect in spontaneous speech, as in the item responses in this study. These other dimensions include vocabulary range and sophistication, grammar accuracy and complexity, content appropriateness, progression and flow of ideas (discourse), and so on. The extraction of such higher-level language features has posed a challenge because (1) errors can be generated by the ASR system, and (2) it is difficult to devise natural language processing technologies to extract meaningful and accurate features, given the spontaneous nature and brevity of speech samples and the errors test takers may make in grammar or vocabulary choice.
Since the early 2000s, research and development work has been undertaken to automatically score not only predictable speech, but also more open-ended, spontaneous speech (Zechner et al., 2009). The automated speech scoring engine SpeechRater computes features in many diverse areas of the speaking construct, including fluency, pronunciation, prosody, vocabulary diversity, grammatical accuracy, and complexity, as well as content.
Method
Description of the data
The speaking section of the English language assessment used in this study elicit a total of 5.5 minutes of speech for a candidate: two independent items that ask test takers to talk for 45 seconds on a familiar topic (e.g., “Describe a person that you admire.”) and four integrated items where reading and/or listening stimuli are presented first, and then the test taker has one minute each to respond to a prompt that is based on these stimuli.
Each response to a speaking item is scored holistically by a single trained human rater on a 4-point discrete scale of 1–4, with “4” indicating the highest proficiency level and “1” the lowest. The scores are assigned based on rubrics, one each for independent and integrated items. The rubrics describe the aspects of the speaking construct that are deemed most relevant for determining the speaking proficiency of test takers and thus guide human raters in their scoring decisions. Each score level has a description of prototypical observed speaking behavior in three main areas of spoken language: delivery (e.g., fluency and pronunciation), language use (vocabulary and grammar aspects), and topic development (e.g., progression of ideas and content relevance). Human raters usually get “batches” of responses for a particular prompt (rather than scoring, e.g., all the responses of one candidate). In addition, a random sample of about 10% of responses in each administration is scored by a second human rater for reliability control purposes. If the two scores disagree by more than one point, a third rater is asked to adjudicate the score. Finally, the six-item scores are aggregated and scaled for score reporting purposes.
Data were drawn from 10 administrations involving 110 countries in 2012–2013. Among the 10 administrations, half of them were mainly from the Western hemisphere and the other half were mainly from the Eastern hemisphere. We randomly sampled 1,100 test takers per administration. The speaking section of the English language assessment consists of six items. This yields a total of 10 × 1,100 × 6 = 66,000 responses that were scored by the SpeechRater engine. We pulled the first human rater scores (H1-rater), including second human rater scores (H2-rater, if available), from a data repository. 1 (Note that “H1” and “H2” are logical labels for human raters; in actuality, “H1” scores and “H2” scores comprise scores from a large number of physical human raters.) As stated above, H2-rater scores were only available for 10% of the data, which is a random sample from the administrations selected for reliability purposes.
During the operational cycle, all human raters (both H1-rater and H2-rater) participated in a standardized training process before they were allowed to rate the speaking items. In this study, we focused on the comparison of the item scores between the H1-rater and SpeechRater. The H2-rater was from the same rater pool as the H1-rater, so there should not be any systematic differences between the H1-rater and the H2-rater. We also made comparisons between the scores assigned by the H1-rater and the H2-rater for the 10% reliability sample.
In addition to the main data set used for this study (66,000 spoken responses), we used 10,000 spoken responses to items in other forms of the same assessment to estimate the parameters of the linear regression model used by SpeechRater. A separate data set of 52,200 responses from the same assessment was used for training the parameters of the ASR system.
The SpeechRater system for scoring spoken responses
The SpeechRater system consists of the following four major components: (1) an ASR system that converts the test taker’s response into a sequence of hypothesized words; (2) a component that computes a set of features related to the Delivery and Language Use constructs based on the ASR output and the speech signal; (3) a filtering model that identifies responses that should not be scored due to construct irrelevance or technical issues; and (4) a linear regression scoring model trained on a set of human-scored spoken responses.
To build the ASR component of SpeechRater, we used a large data set of 52,200 English language assessment responses, consisting of more than 800 hours of speech. This data set was used to train the acoustic model and language model used by the state-of-the-art ASR system licensed from an external vendor. This ASR system achieved a word error rate of around 30% on an independent English language assessment test set. For building the scoring model, we used 10,000 spoken responses (all double scored by human raters) from an English language assessment (“training set”). For evaluating the scoring model, we used 66,000 responses from the same English language assessment, but used different administrations and different test takers (“test set”). This corresponds to the data set described above.
Based on the training set, we selected a set of 13 features representing the English language assessment speaking construct to a large extent, with the exception of the sub-construct of “topic development.” Feature selection criteria included the correlation with human rater scores, normality 2 (determined via Q–Q plots), inter-correlation between selected features, and construct representation. The feature set includes features measuring fluency (e.g., rate of speech; presence of filled pauses, repetitions, and repairs; distribution of pauses), pronunciation accuracy, prosody (distribution of stressed syllables), grammar accuracy, and diversity of vocabulary.
Item analyses
Certain standards have been used to guide the analyses of data for the building and evaluation of automated scoring models (Williamson, Xi, & Breyer, 2012). Classical test theory item analyses statistics and some graphics such as box plots and Shewhart charts were applied as part of the monitoring procedures of both human raters and SpeechRater. These item statistics provide general indications of item quality and possible item development problems and can further be compared across raters or administrations to help indicate potential scoring discrepancies. This study consists of the following statistics to address the research questions: (1) mean differences and SD ratio; (2) standardized mean difference; (3) quadratic weighted kappa; (4) Pearson correlation; and (5) human rater bias. For computing kappa statistics, the raw SpeechRater scores were first truncated and rounded for comparison against the integer human scores. For other statistics, truncated SpeechRater scores without rounding were used for comparison with human scores.
Table 1 displays a summary of the flagging criteria and conditions for evaluating SpeechRater model performance. The criteria used were similar to those recommended by Williamson et al. (2012) and Wang and von Davier (2014). Williamson et al. (2012) recommended that the quadratic weighted kappa and product–moment correlation between automated and human scoring must be at least 0.70 (rounded normally) with the underlying rationale that approximately half of the variance in human scores is accounted for by e-rater. They also recommended that the standardized mean score difference between the human scores and the automated scores cannot exceed 0.15. This criterion is applied to avoid differential scaling between the automated scoring and human scoring. By following the same logic, we also used this criterion to flag items in terms of the mean difference between human rater scores and SpeechRater. In terms of human rater bias, we applied 0.30 as the cut-off value by following Wang and von Davier’s (2014) recommendation in their e-rater and human rater comparison study. In their study, they used both Shewhart chart and 0.30 rule to identify outlier human raters against e-rater. Therefore this rule can be used in our study to detect outlier human raters against SpeechRater as well.
Flagging criterion and conditions for SpeechRater evaluation.
See the bias formula in the “Human rater bias” subsection. The 0.30 rule was proposed and used in Wang and von Davier (2014).
Results
Research question 1
To address the first research question, we compared human raters and SpeechRaterSM to see if they differ in terms of severity and variability using the following analyses including box plots, mean difference/SD ratio, standardized mean difference, correlation, kappa, and Shewhart charts.
Box plots
Figure 1 shows the box plots for the overall mean scores of the H1-rater, H2-rater, and SpeechRater for the six speaking items across 10 administrations. The box plot provides a useful depiction of a moderate to large distribution of numbers. The box or “fence” captures the “interquartile range” representing the center-most 50% of the distribution of values (i.e., the scores ranging from the 25th to the 75th percentiles). The centerline in the box represents the median value and the diamond represents the mean. The “whiskers” extending from the box denote possible skewness in one or both tails of the distribution of values. In general, the overall means of the H1-rater, H2-rater, and SpeechRater scores across all administrations are close to each other.

Box plots of overall mean ssssscores of the H1-rater, H2-rater and SpeechRater across administrations.
Mean differences and SD ratio
Table 2 shows the results for the means, mean differences, and SD ratio of H1-rater/H2-rater and H1-rater/SpeechRater for each administration. No administration was found to have a H1-rater/H2-rater difference larger than 0.15; also, no administration was found to have a H1-rater/SpeechRater difference larger than 0.15. SD ratios for H1-rater/H2-rater range from 0.99 to 1.11, whereas SD ratios for H1-rater/SpeechRater range from 1.47 to 1.66, indicating that SpeechRater scores have a much smaller variance (about 0.24) than that of human raters (about 0.57). The closer the SD ratio to 1.00, the more similar are the variances of the two scores.
Means and standard deviations for human rater scores and SpeechRater scores for each administration.
Note: The numbers in the column under the heading “Count” refer to data for the H1-rater and data for SpeechRater; the H2-rater is only 10% of the H1 data.
Standardized mean difference
Table 3 shows the results for the standardized mean differences between H1-rater/H2-rater and H1-rater/SpeechRater at the item level. Three items were found to have large effect sizes between the H1-rater and H2-rater; a total of 16 items were found to have large effect sizes (> 0.15) between the H1-rater and SpeechRater. For some items, SpeechRater gave higher scores than human raters while for other items, human raters gave higher scores.
Comparison of human rater and SpeechRaterSM scores for each speaking item using standardized mean difference test.
Note: The numbers in the column under the heading “N”, refer to data for the H1-rater and data for SpeechRaterSM. The H2-Rater is only 10% of the H1 data.
Correlations
Table 4 shows the results for the correlations between the H1-rater and H2-rater, and the H1-rater/SpeechRater at the Administration level. The correlations between the H1-rater and SpeechRater scores are slightly higher than those between the H1-rater and H2-rater, ranging from 0.69 to 0.81, with all of them being higher than 0.70 except for one administration (0.69). Note importantly that the H2-rater was randomly selected and represents about 10% of the H1-rater/SpeechRater cases.
Correlations between human rater and SpeechRater scores and two human rater scores.
Note: H2 is only 10% of the H1 data (1100). For the aggregate correlation, each student’s mean scores of the six items for the H1-rater, H2-rater (if available) and SpeechRater are used for the calculation.
Kappa
The “quadratic weighted kappa” statistics for the H1-rater/SpeechRater were calculated (see Table 5). The quadratic weighted kappa by Administration ranged from 0.70 to 0.80, which met the 0.70 requirement.
Quadratic weighted kappa for the H1-rater and SpeechRater.
Quadratic weighted kappa is the mean of the six speaking items’ kappa values for each administration.
Monitoring rating consistency using Shewhart charts
In order to ensure a high quality of scoring, a Shewhart chart was used as a technique for ensuring quality in a measurement process. A Shewhart chart is one of the statistical process control (SPC) techniques that have been widely used in industrial settings as tools for maintaining product quality (Vani, 1995). SPC charts have also been applied to educational measurement (Meijer, 2002; Omar, 2010; Veerkamp & Glas, 2000). This study illustrates how we applied the Shewhart SPC chart to monitor human rater and SpeechRater performance differences over time, an application which differs from all the other applications of Shewhart control charts in assessments (Gao, 2009; Lee & von Davier, 2013; Omar, 2010).
A Shewhart control chart has four elements (the last two elements are combined): points that represent a statistic (mean) of the measurement of a quality characteristic in samples taken from the process at different times, a center line that is drawn at the value of the mean of the statistic, and upper and lower control limits that indicate the threshold at which the process output is considered. Control limits are computed from the process standard deviation.
The upper control limits (UCL) and lower control limits (LCL) are
where k is the distance of the control limits from the baseline (mean of means), expressed in terms of the standard deviation unit.
In this study, control limits (lines) are drawn at six sigma (SDs) from the center line and represent the threshold where points above (or below) those lines are considered outliers. The Shewhart chart was plotted based on the rating difference; the H1-rater was subtracted from the SpeechRater score. If the human rating was more stringent, the rating difference resulted in a negative score and vice versa. The goal of perfect rating agreement between the human and SpeechRater is represented by a zero difference. Thus, a natural control difference of zero (μd = 0) is used as the target for this SPC chart. Larger mean difference effects would be observed if human raters are either more or less lenient than the SpeechRater scoring.
Shewhart control charts were used for displaying the rating mean differences between the SpeechRater and the H1-raters based on a combination of 6 items in each of the 10 administrations. Shewhart control charts were also used to identify potential native language groups whose mean differences between human raters and SpeechRater are much larger than others across the 10 administrations.
We used the Shewhart control charts to identify potential “outlier” rating differences between the H1-rater and SpeechRater across 10 administrations. Figure 2 provides a Shewhart control chart of the overall mean rating differences by the H1-rater versus SpeechRater for the six speaking items by each administration. It is not difficult to see that some mean rating differences are above the six-sigma upper control limit while some others are below the six-sigma lower control limit. For example, in Figure 2, there were three administrations (Admin_E, Admin_G, Admin_I) where the mean rating differences were above the six-sigma upper control limit, and there were two administrations (Admin_F, Admin_J) where the mean rating differences were below the lower six-sigma control limit. These “out-of-control” mean differences indicate that the human raters rate differently from SpeechRater to some extent. One interesting finding is that those administrations that had mean differences close to or beyond the upper control limit are mainly from Western countries whereas those close to or below the lower limit are mainly from Eastern countries.

A Shewhart control chart for the difference between the H1-rater and SpeechRater in the mean score of six items by administration.
Research question 2
To address the second research question, we investigated whether SpeechRaterSM could be used to check human raters’ severity and leniency using the following human rater bias analyses.
Human rater bias
In this study, we chose to examine further the data in order to understand better the trends of human raters by comparing them with SpeechRater scores. To identify those human raters who were stricter or more lenient than other raters, SpeechRater scores were used to help identify the “outlier” human raters by looking at the differences between the ratings of the human raters and their corresponding SpeechRater scores.
The mean of these differences was labeled “Bias.” 3
where Di is the difference between the H1-rater score and the SpeechRater score for item i, and Np is the total number of items scored by a given rater. Raters for whom the bias was equal to or above an absolute value of 0.30 were labeled as potential “outlier” raters.
In order to address research question 2, we used SpeechRater scores to help identify the “outlier” human raters by looking at the differences between the ratings of the human raters and their corresponding SpeechRater scores. The bias between the ratings of the human raters and their corresponding SpeechRater scores was calculated for each human rater. Raters who scored a reasonable number of spoken responses (N > 20) and whose absolute bias value was greater than or equal to 0.30 (an arbitrary cut-off value from our previous e-rater study; see Wang & von Davier, 2014) were listed as potential outlier raters in the study. For the first speaking items (the first items in each administration were combined across administrations, see Table 6) we found 3 harsh and 22 lenient H1-raters out of 210 who scored at least 20 items when comparing their ratings to SpeechRater scores. For the second items across administrations, we found 4 harsh and 14 lenient human raters out of a total of 211 raters; for the third items, we found 4 harsh and only 1 lenient human rater out of a total of 204 raters; for the fourth items, we found 6 harsh and 2 lenient human raters out of 213 raters across administrations; for the fifth items, we found 4 harsh and 4 lenient raters out of 206 raters; for the sixth items, we found 8 harsh and 5 lenient raters out of 203 raters.
List of human rater outliers versus SpeechRaterSM for each item.
Research question 3
To address the third research question, we examined weather SpeechRaterSM scores of different language groups differ in the same way as the scores assigned by human raters using the Shewhart chart and kappa statistics.
Shewhart chart
The performance differences between the H1-rater and SpeechRater in the mean scores of the 6 items across all the 10 administrations were plotted for the 16 largest language groups 4 with more than 100 students within each group (see Figure 3). Fluctuating control limit was also plotted based on the different standard deviation of each operational administration. Outliers of mean score differences were observed for the Chinese language group (with higher SpeechRater scores) and for the German, English, French, Italian, Portuguese, and Russian language groups (with lower SpeechRater scores). It seems that SpeechRater ratings were close to human rater’s ratings for some of the native language groups, but not for the others (e.g., Chinese, German, English, French, Italian, Portuguese, and Russian). For German, the SpeechRater scores seem to be much lower than the corresponding human rater scores when they are compared to the other language groups.

A Shewhart control chart for the difference between the H1-rater and SpeechRater in the mean score of six items by native language group.
Kappa
The mean “quadratic weighted kappa” of H1-rater/SpeechRater was calculated for each of the largest 16 native language groups. Overall, the Japanese group had the highest kappa values (0.86) and the German group had the lowest kappa values (0.39).
Mean quadratic kappa of H1-rater/SpeechRater for each language group.
N is the mean sample size per administration for each language group. Quadratic weighted kappa is the mean of 10 administrations’ kappa values for each language group.
Summary
The quality control of human rater and SpeechRater scores is essential. This paper summarizes some important considerations and illustrates the application of specific statistical and graphical procedures that can help monitor both human and SpeechRater performances. Below, we summarize the findings from our research and suggest recommendations for operational practice.
As for the first research question, we found that although the overall correlations between the H1-rater scores and SpeechRater scores are similar to the correlations between the H1-rater and H2-rater scores, at the item level, we observed more differences between scores from the H1-rater and SpeechRater than between scores from the H1-rater and H2-rater. Items with large discrepancies between the H1-rater and SpeechRater need further investigation. For example, we can use the H2-rater to identify whether a SpeechRater score is an outlier if the H1-rater and H2-rater scores are close to each other, or we can use the H2-rater to identify whether the H1-rater score is an outlier if its SpeechRater score and H2-rater score are close to each other.
At the operational administration level, the H1-rater and SpeechRater agree well if we look at the means and correlations, indicating that the scores from the two scoring methods are comparable. The results from the kappa statistics (quadratic weighted kappa) met our expectation (> 0.70). Additionally, it appears that SpeechRater produced slightly higher scores than human raters for Eastern countries and gave slightly lower scores for Western countries, whereas human raters (both the H1-rater and H2-rater) did not have such a pattern.
Some statistics and charts such as means, SDs, box plots, correlations, and standardized mean differences are very effective in detecting outlier speaking items, human raters, and SpeechRater scores. We recommend they be used jointly while monitoring human rater and SpeechRater scores at the examinee level, item level, and test score level throughout the operational cycle.
As for research question 2, we found a few lenient and a few harsh human raters by using SpeechRater scores since automated scoring is more consistent than human scoring. Such bias analyses can help identify human raters who tend to give harsh or lenient scores. In terms of items 1 and 2 (the two independent items in the test), we identified more lenient human raters than for items 3–6 (integrated items). This might indicate that human raters give more credit to content aspects of these independent items compared to SpeechRater, whose features are only related to the other two construct dimensions, that is, delivery and language use. Due to the limited sample sizes, the results of these rater bias analyses need to be replicated and confirmed by using larger samples (e.g., N > 100).
As for the third research question, by looking at the H1-rater and SpeechRater score mean differences for each of the 16 largest native language groups, the largest differences occurred in the Chinese and German groups. SpeechRater mean scores were higher than the H1-rater scores for Chinese and vice versa for German. One reason for these discrepancies could be the difference in overall mean scores for different native languages. For example, the H1 mean score for German test takers in our data is 3.31, whereas the overall mean score of all test takers is 2.64. In addition, since the scoring model used by SpeechRater only evaluates a subset of the speaking construct, and in particular does not use features related to content and discourse, it may have a positive bias for first language test takers whose responses lack in these dimensions of topic development and vice versa.
Overall, scores for Japanese and Korean speakers have higher kappa values than for other native language groups such as German. More investigations need to be conducted to identify the major reasons regarding why some native language groups have lower kappa values.
In general, we further found that some statistics and charts used in the study, such as standardized mean differences, correlations, kappa values, box plots and Shewhart charts are very effective in detecting the differences between human and automated scoring at both item and administration level.
Discussion and limitations
Generally speaking, this study addresses several important research questions that need to be investigated before an automated speech scoring system such as SpeechRater can be implemented for the operational scoring of English language assessment speaking items. There are some systematic patterns in the score differences between human raters and SpeechRater. Figure 2 seems to suggest a mixture in the mean differences across administrations, which is likely a result of regional effects (East or West). SpeechRater tended to produce comparable mean scores across operational administrations regardless of the region, while the human raters’ scores varied substantially across administrations in different regions. These patterns may relate to how the administrations were scored. Thus, for future analyses, it would be worth exploring the region of individual administrations when introducing the data and examining the results by region. How different item types (independent vs. integrated) are related to the differences and correlations between scores is also worth exploring.
When the standard deviations of SpeechRater were compared with those of human raters, the former were significantly lower, likely related to the central tendency of the multiple regression scoring approach used by SpeechRater. Also, the score distributions of human raters and SpeechRater appear to be somewhat different. A generalizability study may need to be conducted to investigate different sources of error between the two scoring modes. Furthermore, SpeechRater scoring seems to exhibit a small bias against several specific language groups (e.g., German), and in favor of the Chinese group. Reasons could include the following: (1) effects of the central tendency of the linear regression scoring model used by SpeechRater, which reduces the number of predicted scores at the extremes of the scoring scale (having a stronger effect on languages with a higher overall mean score, such as German); (2) differences in considered construct aspects between SpeechRater and human raters in conjunction with the extent to which speakers with different first languages exhibit differential speaking profiles; (3) effects related to the score and language distribution in the training sample (e.g., bias towards certain L1 score distributions); and (4) effects related to differential functioning of certain SpeechRater features for different native languages.
When using automated scoring operationally, language bias can be reduced by using a contributory scoring approach, where both automated and human scores contribute jointly to a final item score.
Biased human raters were also identified by using SpeechRater scores. Since human raters’ scores are based on three areas (delivery, language use, and topic development), whereas SpeechRater scores are based only on delivery and language use, differences between the two types of scores would probably exist even if there were no rater effects.
Another limitation of automated speech scoring is related to imperfect automatic speech recognition: the system used in the current study has a word error rate of around 30%, which is quite substantial. ASR systems for native speech, in contrast, can obtain much smaller word error rates (much less than 5%), but recognizing non-native speech from a large variety of first language backgrounds and speaking proficiency levels remains a challenge for current ASR technology. Still, in recent years, the availability of new algorithms (in particular, Deep Neural Networks), more powerful hardware (Graphics Processing Units), in combination with substantially larger data sets for ASR training, has led to a noticeable reduction in word error rate; for example, Cheng, Chen, and Metallinou (2015) report a word error rate of 23% on open-ended non-native speech.
In this study, we only applied one statistical approach to identify biased human raters owing to the lack of a human rater study related to our speech data. We therefore recommend additional studies and analyses for future work in order to confirm the findings related to the human rater bias in this study.
In this context, we also want to stress that the automated speech scoring engine SpeechRater is still evolving in terms of improved speech recognition, and that additional features, covering a more extended subset of the Speaking construct, are currently under development and are planned to be used in future scoring models. Moreover, we are also exploring alternative approaches to the standard linear regression scoring model currently used by SpeechRater, which may lead to improvements in its scoring performance.
Finally, as a result of this study, we recommend using multiple procedures (statistics and plots) to identify outlier items/administrations and human raters. Note that these recommendations are in some sense preliminary, given that our operational experience with SpeechRater is relatively limited. This study responds to the need for statistical analyses to provide a consistent and standardized approach for monitoring the quality of automated speech scoring over time and across programs (Bejar, 2011; Bridgeman, 2013; Ramineni et al., 2012; Wang & von Davier, 2014). Since studies on effective quality control procedures in these types of analyses along with automated scoring including SpeechRater are lacking, we believe that the use of the statistical analyses in this study is a useful way to identify issues in both human and automated scoring. Graphics (e.g., Shewhart control charts, and box plots) should be an integral part of the quality control system because they present informative multivariate views of the data and can highlight general trends and correspondence with expectations.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
