Abstract
This paper compares two alternative scoring methods – multiple regression and classification trees – for an automated speech scoring system used in a practice environment. The two methods were evaluated on two criteria: construct representation and empirical performance in predicting human scores. The empirical performance of the two scoring models is reported in Zechner, Higgins, Xi, & Williamson (2009), which discusses the development of the entire automated speech scoring system; the current paper shifts the focus to the comparison of the two scoring methods, elaborating both technical and substantive considerations and providing a reasoned argument for the trade-off between them. We concluded that a multiple regression model with expert weights was superior to the classification tree model. In addition to comparing the relative performance of the two models, we also evaluated the adequacy of the regression model for the intended use. In particular, the construct representation of the model was sufficiently broad to justify its use in a low-stakes application. The correlation of the model-predicted total test scores with human scores (r = 0.7) was also deemed acceptable for practice purposes.
The first automated system for scoring pronunciation of read speech was developed more than two decades ago (Bernstein, Weintraub, Cohen, & Murveit, 1989; Bernstein, Cohen, Murveit, Rtischev, & Weintraub, 1990); however, it was not until recently that applications of automated speech scoring have been expanded to include systems for scoring highly constrained non-native speech such as read-aloud and sentence repeats (Bernstein, 1999). Highly predictable speech can be recognized with good accuracy and scored in a straightforward manner. Automated scoring for spontaneous non-native speech such as that elicited by the Test of English as a Foreign LanguageTM Internet-based Test (TOEFL iBT®), however, has presented much greater challenges, owing to the difficulty and complexity in both recognizing and scoring unpredictable speech.
An automated speech scoring system is a complex mechanism that consists of multiple components: the speech recognizer, the feature generation programs, the scoring model, and the user interface (a figure depicting this process is provided in the Supplementary Information Online as Appendix S1). The speech recognizer decodes the input audio files into a recognized string of words. Then the feature generation programs extract scoring features based on output produced by the speech recognizer. The scoring model scores responses to individual tasks based on the scoring features and combines the scores across multiple tasks. The user interface provides the score report and information about score interpretation and use. This article reports on the results of a research and development effort for SpeechRaterSM v1.0, focusing on the third component, the development and evaluation of the automated scoring models. SpeechRater v1.0 is an automated system for scoring the speaking section of the TOEFL Practice Online (TPO), which requires the production of extended speech in a spontaneous manner.
The scoring model is unarguably a crucial piece in an automated scoring engine. Depending on the domain of the application and the intended use, different scoring methods may be best suited for different scoring systems. For example, in cases where being able to understand and explain the scoring logic is a priority, it is preferable to employ a scoring method which is accessible and uses easily interpretable scoring rules. In some cases, multiple scoring methods may be promising candidates (Clauser et al., 1997), which is the case in the present study. Then a comparison of alternative methods will be necessary to select the one that is optimal for the given context and problem.
Automated scoring often relies on complex technological mechanisms and mathematical algorithms to produce a score. Further, the scoring logic and rules utilized by an automated scoring system, although complex, are objective and thus more traceable than the decision process engaged by human raters. There is also more suspicion associated with the scores produced by such systems. Therefore, automated scoring is typically subject to greater scrutiny than human scoring.
Validation efforts for automated scoring have emphasized one or more of the following three approaches (Yang, Buckendahl, Juszkiewicz, & Bhola, 2002): (1) demonstrating the correspondence in scores produced by automated scoring systems and human scorers; (2) understanding the construct represented within the scoring processes that automated scoring systems use; and (3) examining the relationship between automated scores and criterion measures external to the assessment. In this paper, we focus on the first two approaches in our comparison of scoring methods.
Bennett and Bejar (1998) delineate the development of an automated scoring system into two key steps: (1) extracting and implementing relevant features, each of which quantifies a small aspect of the performance; (2) combining them into a score that indicates the overall quality of performance. These two steps can be manipulated to maximize construct representation and to improve the relationships between automated scores and human scores on the same test or on external criterion measures. Bennett and Bejar’s work points to the two critical pieces involved in the development and evaluation of a valid automated scoring model: the scoring features and the scoring model. The construct relevance and coverage of the features, and the way the features are combined to produce scores, provide conceptual evidence to support the validity of the scoring model. These two aspects are what we focus on in our comparison of alternative scoring methods.
Different automated scoring approaches
A variety of methods have been employed in the development of automated scoring models for computer-based assessments, ranging from the more accessible and familiar methods such as linear regression, rule-based expert systems, and classification trees, to the more novel ones including Bayes Nets (Pearl, 1988), and artificial neural networks (ANNs) (Stevens & Najafi, 1993). Williamson, Mislevy, and Bejar (2006) provide a comprehensive collection of papers on these statistical methods used in building automated scoring systems for different domains.
Among these methods, a few have been widely used and proven successful in applications of automated scoring systems. They include rule-based expert systems (Braun, Bejar, & Williamson, 2006; Clauser et al., 1995; Clauser, Margolis, Clyman, & Ross, 1997), multiple regression (Attali & Burstein, 2006; Clauser et al., 1997) and classification trees (Williamson, Bejar, & Hone, 2004).
Rule-based expert systems attempt to model the complex reasoning of experts through simple rules executed by the computer scoring system. They are best suited for domains where the relationships among the scoring features and the construct of interest are well understood and defined. This approach has two particular strengths: (1) The scoring rules attempt to mimic the decision-making processes engaged in by expert human evaluators and are thus perceived as credible; (2) the scoring rules are transparent and can be readily understood. However, the application of this method is constrained by a few limitations. First, the domain of interest has to be sufficiently well defined that scoring rules can be derived to capture in an unambiguous way the relationships between features and evidence in a performance sample and the overall quality of the sample. Second, the features or evidence that are part of the scoring rules have to be manifest in the performance samples and be easy to extract using computer technologies. Third, in a domain where there are many different profiles of performance or many alternative routes to solve a problem, it may not always be possible to derive expert rules that represent all possible profiles or potential solutions.
The expert-based scoring system, although substantively appealing for its transparent scoring rules, is not well suited for our application of speaking assessment due to some of the limitations discussed above. First, it presents immense challenges to implement speech features using speech technologies that indicate all of the key criteria used by human raters to score spontaneous spoken responses. Also, given the complexity of human raters’ decision-making processes involved in scoring speech and our lack of solid understanding of them, it obviously is not trivial to design a rule-based expert scoring system that adequately captures these processes.
The other two commonly used methodologies, multiple regression and classification trees, therefore seem to be more suitable candidates for our application. In selecting the methods to compare for the initial release of SpeechRater, we focused on the more well-known methodologies that yield easily interpretable results. Multiple regression has been found to approximate human judgment quite well in automated scoring engines (Attali & Burstein, 2006; Clauser et al., 1997). As a parametric method, it is also stable and statistically flexible. Classification trees have been used in automated scoring applications such as the licensure of architects (Williamson, Bejar, & Hone, 2004). This method holds out the promise of a model structure which is more congruent with the way in which trained raters make their judgments at the task level. It also has characteristics which are consistent with the nature of the data in this study, which will be discussed below. The decision rules that govern the classification of score classes are also transparent and easily understandable.
The rule-based expert scoring system was ruled out for reasons mentioned above. Other scoring methods, such as Bayes networks, ANNs, or support vector machines (Cortes & Vapnik, 1995), were not pursued owing to their complexity, lack of transparency in model structure, and lack of user-friendly software.
A comparison of alternative automated scoring methods for a computer simulation test of physicians’ patient management skills is reported in Clauser et al. (1997). They compared a regression-based approach in which the weights of the variables were empirically determined, and a rule-based approach in which the scoring logic was articulated by experts first and then translated into computer scoring rules. They found that both approaches yielded good prediction of expert ratings, although the regression-based approach yielded superior performance. They concluded that simple statistical approaches such as regression can be used to approximate expert judgment effectively. The more elegant and complex rule-based approach, although quite appealing, may suffer from a number of limitations, including lack of thorough understanding of experts’ decision-making processes and loss of accuracy in computer’s translation of the rules.
In this study, we chose to compare multiple regression and classification trees. A brief review of classification trees, less well-known to the readership, is provided below.
A brief review of classification trees
Classification trees are hierarchical, sequential classification structures that recursively partition the observations into different classes. At each decision node, the variable that can best classify the cases into distinct classes at a certain value is selected to perform the partition. Then each of the child nodes may be further partitioned into more nodes down the tree, and the process is repeated until a terminal node is reached.
The Classification and Regression Trees (CART) software (Steinberg & Colla, 1997) employs the original methodology developed by Brieman, Jerome, Olshen, & Stone (1984). CART analyses are typically conducted in three steps: (1) tree growing (the maximum tree is grown on the training sample); (2) tree pruning (the maximum tree is pruned to produce a sequence of nested trees and error rates are obtained for each tree using a cross-validation sample or an independent testing sample); and (3) optimal tree selection (the best tree is identified that represents a balance between the error rate and complexity based on the cross-validated or the independent testing sample).
In CART, a few methods are essential for refining the analyses for special circumstances; these include the use of prior probabilities and different splitting rules. When the probabilities of different classes are known in the population but the sample used to train the data is not representative of the population, it may be important to use the priors of the population since this will affect the estimation of error rates. By adjusting the prior probabilities of certain classes that are deemed important, trees can be grown that misclassify them less often. CART also offers a variety of splitting rules such as Gini and Twoing (Steinberg & Colla, 1997). It is usually preferable to explore different splitting rules to select the one that provides the best results for a particular problem.
A potential advantage of classification trees is that they do not assume that the underlying relationships between the predictor variables and the predicted classes are linear, follow some specific non-linear link function, or are monotonic in nature. Moreover, a variable that does not discriminate well in the higher-score levels can be used in classifying lower-score levels without impacting the prediction of the higher score levels. These characteristics of classification trees contrast with other classification or prediction techniques, which use all of the important variables in the same way for classifying or predicting each case. Since the distinguishing speech features for different score bands may be different and the relationship between a speech feature and the human speaking score may not be linear, conceptually, classification trees appear to be a suitable technique for this application. In addition, different patterns of strengths and weaknesses in different aspects of speech may lead to the same score band. This is compatible with another feature of classification trees that allows different sets of decision rules to result in the same score band.
The next section provides a brief overview of TPO, for which SpeechRater has been developed.
The TPO assessment
The TPO assessment is designed to help prospective examinees prepare for the TOEFL iBT test by offering them opportunities to practice with retired test forms. TPO users can customize their practice and take the test in a timed or untimed mode. The timed mode attempts to simulate operational testing conditions whereas in the untimed mode, users can progress at their own pace, starting or stopping the test whenever they like and re-recording responses they have submitted if desired.
The TPO Speaking Practice test contains six tasks. The first two are independent tasks that ask candidates to speak about everyday familiar topics. The remaining four tasks are integrated tasks that require test takers to use reading, listening and speaking skills in combination. The TPO Speaking Practice test uses the same scoring rubric and the same pool of raters as the TOEFL iBT Speaking test. The raters issue a holistic score for each response on a score scale of 0–4.
Data used in the study
We used two data sets: responses to the TPO assessment (the TPO data set) and responses from a TOEFL iBT field study (the field study data set). We partitioned the TPO data into a few sets and used them for different purposes. The TPO rec-train set was used for recognizer training, while the TPO sm-train set (scoring model training) was used to evaluate the statistical properties of the features before model building. We also used the TPO data for scoring model training and evaluation. The TPO sm-train set was used for model building, and for evaluation we used both the TPO sm-eval set by itself, and the TPO sm-eval set combined with the TPO rec-train set. Correspondingly we partitioned the field study data set into the sm-train set for model building and the sm-eval set for evaluation. Appendix A summarizes the different data sets and how they were used.
TPO data
The TPO data contained 4162 responses from four test forms, with each form containing six tasks. A single human score was assigned to each response in order to report practice test scores to TPO test takers. A second human score was obtained on each response as part of a special intensive human scoring job. For the purposes of model building and analysis, we used the second set of human scores obtained in the special rating effort, because they were undertaken under more optimal scoring conditions. Because the volumes were low, the practice responses were scored by a pool of 4–5 TOEFL iBT Speaking scoring leaders (more experienced raters) as they came in rather than in batches to satisfy the need for quick score reporting. The raters may jump around items when there were a small number of test takers to score. However, in this special intensive scoring job, the scoring was completed by approximately 40 TOEFL iBT Speaking raters and nine scoring leaders and two senior scoring leaders within a day. Following operational scoring procedures, the raters listened to benchmarks before scoring each item type and scored each item type for approximately two hours before moving on to the next item. The non-adjacent disagreements were adjudicated and the bulk of the adjudicated scores tended to agree with the second set of ratings obtained as part of the intensive scoring job.
Each response may be assigned a score in the range of 1–4, or 0 if the candidate makes no attempt to answer or produces a few words totally unrelated to the topic. It may also be labeled as ‘technical difficulty’ (TD) when technical issues may have degraded the audio quality so that a fair evaluation is not possible.
The scoring model train (sm-train) and evaluation (sm-eval) sets included responses from the TPO data with human scores in the range 1–4. The TD or 0 responses were treated separately using a filtering model, the performance of which is reported in Xi, Higgins, Zechner, and Williamson (2008). 1
The partitioning of the TPO data was done to avoid overlap between speakers or tasks between the training and evaluation sets. The partitioning was also designed to minimize speaker and task overlap between the recognizer training set and all other sets, although this constraint could not be enforced absolutely. In order to ensure that all data partitions were of sufficient size for their intended purposes, while meeting our other constraints, we were forced to accept some speaker and task overlap between the rec-train partition and other partitions. However, there was no speaker or task overlap between the scoring model training and evaluation sets. The partitioning process was also designed to ensure that the scoring model training and evaluation sets contain the following:
a broad set of tasks;
similar proportions of responses from speakers of particular linguistic backgrounds; and
approximately the same proportion of responses to independent and integrated tasks.
This resulted in the division of the TPO data scored in the range of 1–4 into three sets (Table 1). The exact agreement between human raters was 57.2%, with a weighted kappa of .54, and correlation of .55. The level of human agreement improves as we aggregate scores across tasks (Table 2).
Summary statistics of TPO data scored in the range of 1–4
Human agreement on aggregated TPO scores (TD and 0 scores omitted) for the TPO Scoring Model Evaluation set + Recognizer Training set
Note: Exact and adjacent agreement rates are typically reported at the task level.
Field study data
The TOEFL iBT Field Study was a field study undertaken before the official roll-out of the test. We used the field study data in doing some evaluation runs for a number of reasons. First, the conditions under which the field study data were scored were closer to best practice than they were with the TPO data sets. Second, the partitioning of the field study data allows for better evaluation of the effects of item score aggregation, since the evaluation set contains more examinees with complete sets of six task scores. Finally, evaluation on the field study data provides information about how our model generalizes across populations and audio file formats.
The field study data contained 3502 responses from a single TOEFL iBT Speaking test form. We did not train a new recognizer for this data to maximize the recognizer’s performance; we used all the data for the scoring model train (sm-train) and evaluation (sm-eval) sets. These two sets of data were constructed to maximize the number of examinees with six complete tasks in a set so that we could evaluate candidates’ total scores on this section. This constraint prevented us from enforcing a ban on task overlap between the sm-train and sm-eval sets, but did allow us to prevent speaker overlap. Table 3 shows the properties of these two data sets.
Summary statistics of TOEFL iBT Field Study data sets
Only about 20% of the responses in these sets were double-scored, so we evaluated the level of human agreement on the subset of the data which had been double-scored. These results are provided in Table 4. (Note that a random sample of responses was double-scored so we did not have enough double-scored responses to provide agreement results for sets of six tasks.)
Human agreement on aggregated field study scores
One point to note is that the human–human agreement as indicated by the weighted kappa and the correlation was much higher for the field study data than for the TPO data. This reflects in part the fact that the field study scores were more varied and more evenly distributed across the four score levels than the TPO scores. In contrast, in the TPO data, the scores clustered around 3, with very few at score 1. After adjusting the marginal totals of the TPO sm-eval human–human score matrix to mimic the distribution of marginal totals similar to that in the field study data (Haberman, 1979), the weighted kappa estimates on single tasks for the TPO data increased from .55 to .76, and correlations between the two human ratings increased from .56 to .76.
Development and evaluation of scoring features
Although the focus of this paper is on the comparison of scoring methods, we have included a discussion of the development, evaluation and selection of the features, as it bears on the substantive meaning of the scoring models.
The construct of interest that motivates the scoring features
The TOEFL iBT Speaking test measures test takers’ ability to speak about everyday familiar topics, and summarize, synthesize, and integrate written and audio materials related to campus life and academic course content, and present the information orally in a comprehensible, coherent, and appropriate manner. The scoring rubric represents the construct of speaking. The rubric consists of three major performance categories, Delivery, Language Use, and Topic Development. Raters consider the combined impact of the three categories and assign a holistic score. In assessing Delivery, raters consider the speaker’s pronunciation, intonation, rhythm, rate of speech, and degree of hesitancy. Language Use refers to the diversity, sophistication, and precision of vocabulary, and the range, complexity, and accuracy of grammar. When assessing Topic Development, raters take into account the progression of ideas, the degree of elaboration, the completeness, and, in the case of integrated tasks, the accuracy of the content.
Feature evaluation and selection
The speech recognizer produces various types of output (e.g. word hypotheses, temporal information) that is fed into the feature generation programs to extract the scoring features. We used a speech recognizer that was specifically trained on the TPO data, which achieved an accuracy of 51.4% on an independent evaluation set of 395 responses.
A content advisory committee (CAC) was convened that consisted of five assessment specialists with extensive experience in developing and rating speaking assessments. This committee was charged with the task of reviewing the candidate speech features and the scoring models from a construct perspective.
A total of 29 candidate features were derived based on the scoring rubric discussed above, drawing on the relevant literature (see Xi et al., 2008 for a review of the literature that informs the design of the features) and extensive feedback from the CAC. The CAC made formal evaluations of the construct linkage and coverage of all the features. Using the formal rating form in Appendix B, they first rated independently how well each feature was linked to the rubric and represented the feature class (e.g. fluency) and the dimension (Delivery, Language Use, and Topic Development) and how well the combined set of features represented the rubric. Then, they discussed their ratings and adjusted them, if necessary. We selected a total of 13 features based on the ratings from the CAC (Table 5).
Final set of 13 features used in building the scoring models
Notes:
These were correlations before some features were transformed.
Removed due to high correlations with other features.
Then we examined the inter-correlations among these features. If two features were correlated at .90 or higher, one of them was excluded for consideration for building the multiple regression models based upon their linkage to the construct, their conceptual overlap with other existing features, and the strengths of their relationships with human scores (Table 5). This process eliminated two features, wpsecutt and silpsec.
The final eleven automated features represented partially four aspects of the scoring rubric: Fluency, Pronunciation, Vocabulary diversity, and Grammatical accuracy. Both fluency and pronunciation are related to the Delivery dimension of the rubric; Vocabulary diversity and Grammatical accuracy indicate the Language Use dimension.
Comparison of the two scoring models
This section describes the development and evaluation of the two scoring methods. The evaluation addresses the appropriateness of the scoring models to the construct, as well as the empirical performance of the scoring models in relation to human scores.
Standards for evaluating the scoring models
The scoring models were evaluated on the basis of technical quality and construct representation.
Technical quality
Our evaluation of the technical quality of the scoring models focuses on three aspects:
agreement of automated scores with human scores;
human-automated score agreement in comparison to human–human score agreement; and
mean score differences between automated and human scores.
The primary measures used to assess the automated–human score agreement are correlation and root mean squared error (RMSE). Another primary measure reported is quadratically weighted κ (Cohen, 1968). However, it is computed in terms of rounded scores, and therefore is based on incomplete information about the prediction of multiple regression models, especially at the task score level.
We also reported the exact and ‘exact + adjacent’ agreements with the human scores, as they have been used widely in other work on automated scoring (Rudner & Liang, 2002; Valenti, Nitko, & Cucchiarelli, 2003). For multiple regression models, we reported the correlation and the RMSE using unrounded as well as rounded predictions of scores to have a standard of comparison which applies to both regression-based and classification-based methods.
We also reported the mean and standard deviation of the scoring models’ predicted scores and of the human scores assigned to each set of responses. These serve to measure any bias or shrinkage which the models might exhibit.
In the development of these evaluation criteria and their application to the scores produced by SpeechRater, we were assisted by a panel of psychometricians with extensive experience in evaluating automated constructed-response scoring technologies. They comprised our Technical Advisory Committee (TAC) and played a comparable role in their oversight of the measurement issues to that played by the CAC regarding construct issues.
Construct representation
In the evaluation of the construct representation of the features, the following factors were considered:
the extent to which the features in the scoring models are linked to and cover the construct; and
the extent to which the way the features are combined to produce scores captures the expected relationships between the features and the speaking scores.
To evaluate the construct representation of each scoring model, the CAC members provided overall ratings of the construct representation of each model using a formal evaluation form (Appendix C). They considered the relevance and coverage of the features present in the model as well as the meaningfulness of the contribution of features to scores.
Evaluation of technical quality
Assumption checking, statistical transformations, and outlier processing
We checked the assumptions of multiple regression. While other assumptions were met, we performed transformations of some non-normal features and processed outliers to satisfy the normality and non-outlier assumptions. We transformed a few features to make the distributions more normal as well as to improve the correlations between the features and the human scores (for a list of the transformed variables and the transformations applied, see Xi et al., 2008). We also examined the feature distributions for outliers and decided that a cutoff value of 4 standard deviations from the mean best isolated the outliers in our training data. We then mapped all feature values outside this range to the maximum or minimum values allowed. Because classification trees are robust to outliers and do not assume normality of the data (Steinberg & Colla, 1997), no variable transformations or outlier processing were performed on the features used by the CART models.
Model building: Multiple regression
The weights of the features in a multiple regression model can be determined empirically based on the data, or set by experts who have an intimate understanding of the relative contributions of the features to the prediction of human holistic judgments. Our aim was to produce a model with high agreement with human raters, but also to structure the model so that its use of our predictive features is consistent with our understanding of the speaking construct. Toward this end, we used fixed feature weights instead of empirically determined weights. 2
We proposed four candidate models with different subsets of the 11 features and feature weighting schemes. The CAC selected a model with five features amscore, wpsec, tpsecutt, wdpchk, and lmscore and with the weights specified in Table 6. They thought that the combination of these five features provided the widest coverage of the speaking construct and that the weights also best captured the expected relationships between aspects of speech indicated by the features and overall speaking proficiency.
Features used in the regression model
One shortcoming of just using the TPO sm-eval set for evaluation is that there were only 58 candidates with complete sets of six task scores. To address this deficiency, we performed an additional evaluation on the combined data from the TPO sm-eval and rec-train sets, which contained many more (308) complete sets of six task scores.
Model building: CART trees
CART 5.0 was used to build the classification trees. We explored different model configurations (i.e. different combinations of priors and splitting rules). For each combination, a 10-fold cross-validation was conducted. In each set of 10-fold cross-validation, a tree was first grown on the entire sm-train sample. Then the sm-train set were divided into 10 subsets of equal sizes, stratified on the dependent variable. In each cross-validation run, one subset was used as the testing sample, with the remaining nine subsets as the training sample. This process was repeated 10 times. After the completion of the 10 runs, the error counts from each of the 10 test samples were summed to obtain the overall error count for each subtree in the whole-sample tree sequence (Steinberg & Colla, 1997). Subsequently, the optimal subtree that was a relatively small tree with the highest or near-highest agreement with the human scores (weighted kappa) on the cross-validation sample was identified.
Results: Multiple regression
The results of the multiple regression model on the TPO sm-eval set are shown in Table 7, broken down into those for scores on a single task, and for aggregated scores on three tasks. The correlation of the predicted unrounded scores with the human scores ranged from 0.46 for single tasks to 0.56 for sets of three tasks. There was also less variation in SpeechRater’s score estimates. The standard deviation of predicted scores (0.32) (Table 7) for single items was considerably lower than that of human scores (0.69) (Table 1).
Performance of the regression model for different sets of scores on TPO Scoring Model Evaluation set
Note: The agreement, correlation, weighted kappa, and RMSE in Tables 7–10 were computed between scoring model predicted scores and human scores.
We also tested our regression model on a data set comprised of the TPO sm-eval set combined with the TPO rec-train set. While the TPO rec-train set is not technically a pure unseen evaluation set, because it was used in the training of the speech recognizer, it was not used in the parameterization of the scoring model itself. The results of the regression model for this combined data set (Table 8) were generally in line with the results shown above for the sm-eval set only, but with the addition of results for a complete set of six tasks. For this total raw score summed across six tasks, the correlation between predicted scores and human scores was 0.57. This was somewhat higher than those for smaller sets of aggregated items.
Regression model performance on TPO Scoring Model Evaluation set + Recognizer Training set, and TOEFL iBT Field Study Scoring Model Evaluation set
As a final evaluation, we applied the same regression model to the field study data set to see how well the scoring model performs, given more varied distribution of candidates’ scores. Indeed, as Table 8 shows, the greater variability of the human scores seemed to make a large difference in the model’s performance, as we achieved a correlation of .68 between the scoring model predicted score and the human-assigned score for six tasks combined (additional model performance results for scores aggregated across fewer items are provided in Appendix S2 of the Supplementary Information Online).
Results: CART
In this particular problem, mixed priors (average of equal priors across score levels and the priors of the scoring model training sample) with the Gini splitting rules yielded the highest weighted kappa with the human scores on the cross-validation sample, among all the combinations of priors and splitting rules. The TPO sm-eval sample cases were dropped down the best tree to obtain the classification rates. The optimal tree using the mixed priors and the Gini splitting rule is presented in Figure 1. This tree shows visually the features that partitioned the TPO sm-train cases into different score classes (terminal nodes) at certain splitting values. Conceptually, the splitting features and values define the boundaries of different score classes.

The optimal tree for classifying different score classes (mixed priors, Gini splitting rule)
Table 9 shows the performance of the CART model. To facilitate comparison, the same statistics reported for multiple-regression models are also reported for CART models.
Performance of the CART model (mixed priors, Gini splitting rule) on TPO Scoring Model Evaluation set
Additional model evaluations were conducted with the TPO Scoring Model Evaluation set + Recognizer Training set and with the Field Study Scoring Model Evaluation set. When scores were aggregated across six tasks, we saw a correlation of .57 with the human scores (Table 10). A CART tree trained on the Field Study Training set was able to yield a correlation of .70 with human scores for sets of six tasks on the Test set, indicating that there is a potential for better performance with more variation in the scores (additional model performance results for scores aggregated across fewer items are provided in Appendix S3 of the Supplementary Information Online).
CART (mixed priors, Gini splitting rule) model performance on TPO Scoring Model Evaluation set + Recognizer Training set, and TOEFL iBT Field Study Scoring Model Evaluation set
Evaluation of construct representation
Two steps were followed by the CAC committee to review the substantive meaning of the tree structure. First, the splitting features and the relationships between the splitting features and the score classes were examined. Specifically, the CAC evaluated whether the decision rules that led to the classification of students into different score classes (the terminal nodes) were consistent with their understanding of some typical profiles of students represented at each score level. The second step involved an examination of the splitting values at each decision point. Each splitting value was examined to ensure that the one selected by the tree algorithm corresponded empirically to the CAC’s judgments. To facilitate the second step, borderline cases at each decision point (feature values close to the splitting value), along with some misclassified cases, were reviewed.
Six features were present in the tree shown in Figure 1: amscore, wpsec, wdpchk, silmean, longpmn, and lmscore. The first one, amscore, was a pronunciation feature (Delivery), the second through the fifth fluency features (Delivery) and the last one, lmscore, a grammar feature (Language Use). Regarding the construct coverage of the features, the CAC determined that key pronunciation and fluency features were well represented and noted that the only grammatical feature was also present in the tree.
Then the decision rules that led to the terminal nodes were examined by the CAC. The different scoring rules for each score class were deemed to be consistent with some of the typical profiles of students at a particular score level. However, a close examination of the cases in each terminal node would be useful to provide further confirmation that these profiles were typical. If they were determined to be non-typical, we could choose to remove the paths that led to the corresponding terminal nodes.
Although we did not complete this second step, the work gave us some confidence that it was feasible to use expert judgments to examine the appropriateness of the splitting rules.
Table 11 shows the ratings of the substantive meaning of the multiple regression and the CART models by four CAC members, using the rating form in Appendix C. On all three questions, the CAC members showed a preference for the CART model, especially on Questions 2 and 3. They thought that the CART model was a better representation of the relationships between automated features and human scores and that the way the model operates was more consistent with how expert raters decided on a score.
Evaluations of the two candidate models by CAC members
Notes:
One CAC member was not able to participate in this evaluation.
MR stands for multiple regression.
Judgments on the CART model were made before it could be modified for construct considerations whereas the multiple regression model was an expert weight model endorsed by the CAC.
Selecting the final scoring model
In selecting the final scoring model, the substantive meaning of the model, the mathematical and statistical principles underlying each model type, and the empirical results of each model were considered.
Despite the preference for the CART model by the CAC from the perspective of substantive meaning, both the multiple regression model and the CART model were judged to be adequate in representing the rubrics and in capturing the relationships between the automated features and the speaking construct for use in low-stakes practice settings, with ratings of 3 or above on a 5-point scale on Q1 and Q2. Although the models did not include any topic development features such as coherence and content relevance, and they did not cover the full spectrum of language use, they represented the delivery features very well, especially fluency. Fluency is not a knowledge base that speech production draws on. Rather, it is related to multiple knowledge bases in speech production and manifests the degree of automaticity of deeper cognitive processes engaged during speech production. Therefore, fluency tends to be a key aspect of speech that indicates the performance level of a speaker.
The fact that the scoring models used only a subset of the features human raters use did not seem to have a detrimental impact on the model performance, because the features included are key indicators of speaking performance and different aspects of the speaking construct tend to be highly correlated (Xi & Mollaun, 2006).
For the TPO data, the performances of the two models in terms of correlations with human scores were similar as the scores were summed across six tasks. As summarized in Table 12, for scores summed across six tasks, the correlations between the predicted scores and the human scores were the same (r = .57) for the two models and the weighted kappa for the total score was higher for the CART model (.55) than for the multiple regression model (.51).
Comparison of the multiple regression model and the CART model (mixed priors, Gini splitting rule) in correlation and kappa between model predicted scores and human scores
We obtained significantly higher correlations with the human scores for both models (.68 for the multiple regression model and .70 for the CART model) on the field-study data. This suggests that the performance of the models is likely to improve with data with more score variability and more evenly distributed score levels.
Regarding the technical quality, in general, multiple regression, as a parametric model that assumes a linear relationship between the features and human scores, can generally provide a good prediction when linear relationships are observed or approximated in the data. In contrast, CART does not assume that the human scores are particular functions of the features. If it identifies multiple structures in the data, it may partition data into different regions and attempt to come up with a summary for the data structure in each region. This method may work well, if indeed, strong non-linear relationships are present in the data. As for sample size requirement, multiple regression requires a much smaller training sample to produce a stable solution than CART, because in CART, the relevant data becomes less for each region, requiring a large sample to yield a stable solution.
With regard to this particular problem, the multiple regression solution yielded fairly similar results in predicting the total test scores as the CART model. This level of agreement was acceptable for a low-stakes application such as the TPO, given that we obtained much higher correlations on the field-study data, which were more varied and more evenly distributed across the four score levels.
With regard to prediction bias, the multiple regression model produced lower bias in the predicted total test scores, as shown in the means and the associated RMSE estimates. The multiple regression model was able to reproduce the mean of the total human test scores better than the CART model, and the RMSE estimate of the multiple regression model was smaller as well.
While the CART model was preferred by the CAC from a substantive perspective, the multiple regression model was favored by the TAC for its stability, parsimony and algorithmic simplicity. A multiple regression model that uses fixed weights based on expert judgments is also more flexible compared to a CART model if we need to modify the model to match the parameters of shifting TPO user populations (i.e. changes in means and standard deviations of scores) as a multiple regression model would only involve rescaling to the mean and standard deviation of the new population. Using CART for future scoring model updates to accommodate potential population shifts would involve re-training of the models on the new data sets, which would often result in completely new tree structures. This would create greater difficulty in maintaining the stability of the model structures and hence the meaning of the models. Given the above considerations, a decision was made to use the multiple regression model for the SpeechRater v1.0.
Discussion and conclusion
This study investigated the development and evaluation of two alternative scoring models for SpeechRater v1.0 used for low-stakes practice purposes: multiple regression and classification trees. The two methods were selected for comparison because both had characteristics that were desirable for this particular application. The processes we followed for model development and evaluation represented a principled approach to maximizing two essential qualities: that the model be both substantively meaningful and technically sound. To ensure these two qualities, clear evaluation standards were developed to guide the development and evaluation efforts, and the opinions of content and technical experts were actively sought in this process.
Based on the evaluation results, we concluded that a multiple regression model with feature weights determined by content experts was preferable to the CART model. The multiple regression and CART models yielded comparable correlations with human raters for total test scores. Although some aspects of classification trees are more congruent with the characteristics of the speaking construct, the parsimoniousness and stability of multiple regression as a parametric model made it the preferred choice for the initial version of SpeechRater. The construct representation of the multiple regression model with expert weights was adequate to justify its use in a low-stakes application. Although CART models may bear more resemblance with the way expert human raters are expected to arrive at a score, closer approximation to human rating behavior is not necessarily the most important factor in selecting a model. This is because the ‘mechanical’ process machines typically follow to generate scores is fundamentally quite distinct from the human scoring process, which is more holistic and flexible. A somewhat comparable model with an adequate level of construct representation may be considered sufficient.
The model’s agreement with human raters was not sufficiently high to support high-stakes decisions, but was still suitable for use in low-stakes applications. The correlations between the SpeechRater scores and human scores ranged from .60 to .70 on different datasets. Similar or higher correlations have been reported for other automated speech scoring systems. However, these systems are either used to score simple listening/speaking tasks that do not elicit extended spontaneous speech, or to evaluate pronunciation based on read speech produced by a single L1 group.
For example, the Versant tests contain automatically scored tasks such as read-aloud, sentence repeats, re-arranging word groups, or questions that require answers of mostly one word. Correlations between automated and human scores for the reported subscores ranged from .89 to .93 and the correlation was .97 for the Overall score for a balanced testing sample of 50 test takers (Bernstein & Cheng, 2008). The correlations were based on scores aggregated across a large number of simple tasks that are more likely to be scored accurately and consistently by both human raters and machine than the complex tasks in TOEFL iBT Speaking. Further, the sample size was small (n = 50), suggesting limited representation of test takers with diverse backgrounds. Also, the testing data set was carefully chosen to contain balanced score levels, which typically yields better correlations than skewed data as observed in our study.
In the studies on pronunciation evaluation, the systems were evaluated on one L1 group and relatively small samples whereas our system was evaluated based on a large population of very diverse L1s. For example, based on sentences read by 20 Greek L1 speakers of English (four at the native level), Moustroufas and Digalakis (2007) found that the best performing algorithm could predict pronunciation scores that correlated with human score scores at .80 at the speaker level. Franco et al. (2000) reported a correlation of .62 with human pronunciation grades at the sentence level for an automatic pronunciation grading system using sentences read by 100 American students speaking in French. In a similar study, they found a correlation of .62 at the sentence level using 206 American English speakers of Spanish (Franco et al., 2010).
Given the limited construct representation and modest prediction accuracy of the multiple regression scoring model, recommendations were made to release the model for use in the TPO with the following conditions: 3
Prediction intervals should be reported to indicate the error bands around the automated scores.
Limitations of this version of SpeechRater should be communicated.
The distinction between the scoring methods used for the TPO and for the TOEFL iBT test should be stressed.
The low-stakes practice use of the scores should be emphasized.
The recognition accuracy of the current system is not high because the speech elicited is extended and unpredictable. SpeechRater relies largely on fluency, pronunciation, grammatical accuracy and lexical diversity features to predict human scores, with no content-related features. Therefore, if a memorized off-topic response were submitted, the system would still be likely to assign a high score based on the features it can currently analyze. However, in a practice environment where TPO users intend to gauge their readiness to take the official test, they may be less inclined to ‘trick’ the system (Xi, Schmidgall, & Wang, 2011), although we realize that the probability of them tricking the computer system would be much higher if the scores were used for high-stakes decisions. In the SpeechRater FAQs on the TPO website, we also encourage TPO users to respond to the tasks seriously as they would during a TOEFL iBT testing situation to be able to obtain a more accurate score from SpeechRater.
The modest results point to the need to improve the prediction accuracy and to expand the construct coverage of the scoring model. 4 The speech features used in this version of SpeechRater did not represent the full spectrum of the criteria that human raters use. An expanded set of speech features may lead to larger differences in the empirical performances of the two scoring methods.
In this study, we used the human scores as the ‘gold standard’ for evaluating the quality of the automated scoring models. While it is generally acknowledged that human scores are prone to error and therefore may not be an ideal basis for evaluation of the quality of automated scoring, they continue to be a highly relevant and convenient criterion. However, given the potential error in human scores, averages of scores from multiple raters or consensus scores from a panel of expert raters, as was done in Clauser et al. (1997), rather than a single set of human scores would be more appropriate criteria for developing and evaluating automated scoring models. This may also improve the performance of the scoring model trained to maximize the prediction of human scores. Further, it is possible to conduct additional evaluations against external criterion measures beyond direct comparison with human scores alone (Powers, Bursterin, Chodorow, Fowles, & Kukich, 2002; Weigle, 2010), although we need to establish the validity and reliability of such measures as we would for human scores.
This study used expert judgments as the basis for evaluating the construct representation of the features and the scoring models. Admittedly we are just beginning to understand the complexity of the cognitive processes involved in human judgments. For example, rater verbal protocol research has started to reveal the rich and complex nature of raters’ decision-making processes (Cumming, 1990; Lumley, 2002). More introspective studies are still needed to capture the richness and intricacies of fine human judgments. There might be a seeming contradiction to rely on human judgment in accepting the scoring model from a construct perspective while acknowledging a lack of thorough understanding of human decision-making processes. Nonetheless, by using multiple experts and a principled way to gather expert judgments in this study, we hope to have captured some commonalities in their judgments. Making judgments on automated features and scoring models is an extremely challenging process. Since we started to involve the CAC during the feature conceptualization stage, all of them have developed good conceptual understanding of the features. The two scoring methodologies that we compared are also relatively easy to understand conceptually. Therefore, it is safe to assume that the experts in this study were quite comfortable with the judgment task. Nonetheless, if resources allow, a larger group of experts should be used to improve the rigor of the process used to elicit expert judgments.
The process we followed in this project can inform similar efforts to develop and compare automated scoring models. In particular, considerations should be given to both the construct representation and the technical quality. In many cases, there may not be a clear winner among alternative models that is superior in both aspects. Then a process similar to the one adopted in this study can be followed to select a method that achieves a balance between the technical and substantive qualities to maximize the overall quality of the model.
Footnotes
Appendix
Rating form for evaluating the construct representation of candidate scoring models
| 1. How well do the features included in the model represent the TOEFL iBT speaking rubric? | ||||
| 1 | 2 | 3 | 4 | 5 |
| Not well | Moderately well | Very well | ||
| 2. Given the limited number of automated features available, how well does the model capture the relationships between automated features and the speaking construct? | ||||
| 1 | 2 | 3 | 4 | 5 |
| Not well | Moderately well | Very well | ||
| 3. How consistent is the model with the decision-making processes that human raters use to derive a holistic score? | ||||
| 1 | 2 | 3 | 4 | 5 |
| Not well | Moderately well | Very well | ||
