Abstract
Criticism of specific-purpose language (LSP) tests is often directed at their limited ability to represent fully the demands of the target language use situation. Such criticisms extend to the criteria used to assess test performance, which may fail to capture what matters to participants in the domain of interest. This paper reports on the outcomes of an attempt to expand the construct of a specific-purpose test through the inclusion of two new professionally relevant criteria designed to reflect the values of domain experts. The test in question was the speaking component of the Occupational English Test (OET), designed to assess the language proficiency of overseas-trained health professionals applying to practise their profession in Australia.
The criteria were developed from analysis of health professionals’ feedback to trainees, a source that reflected what the professionals value, that is, their indigenous assessment criteria. The criteria considered amenable to inclusion in the OET were as follows: (1) Clinician Engagement with the patient and (2) Management of Interaction in the consultation. Seven OET assessors were trained to apply these professionally relevant criteria at a workshop that introduced a checklist derived from the original data analysis as a tool to aid understanding of the new criteria. Following the workshop, assessors rated a total of 300 pre-recorded OET speaking test performances using both new and existing criteria. Statistical analyses of the ratings indicate the extent to which a) the judgements of the language-trained assessors using the new criteria were consistent and b) the new and existing criteria aligned in terms of the construct(s) they represent. Furthermore, feedback from the assessors in the process shows how comfortable and confident they are to represent a health professional perspective.
Keywords
Introduction and literature review
Criticism of specific-purpose language (LSP) tests is often directed at their limited ability to represent fully the demands of the target language use situation. Such criticisms extend to the criteria used to assess test performance, which may fail to capture what matters to participants in the domain of interest. In the context of a language test for healthcare communication, test developers face a challenge in constructing rating scales that are accessible to language-trained assessors, who are responsible for implementing the assessment, at the same time as capturing aspects of communication that are of specific importance in healthcare contexts. This paper reports on the outcomes of an attempt to expand the construct of the speaking component of the Occupational English Test (OET), a specific-purpose language test for health professionals, through the inclusion of two professionally relevant criteria designed to reflect the values of domain experts. These criteria were derived from a detailed analysis of feedback commentary of experienced health professionals on the performance of trainees in interaction with patients. Analysis of data collected in the trial operationalization of the criteria provides information about the capacity of assessors to apply the criteria consistently and about measurement characteristics of the new, domain-related, criteria in relation to those of the existing, language-related, criteria.
The authenticity of a performance test depends on the degree to which it reflects the demands of the target language use situation (Bachman & Palmer, 1996). To this end, test tasks and content should aim to be relevant to and representative of the domain, and further, as stressed by Messick (1995), key features of the domain also should be reflected in the rating scale. In a shift away from theory-based rating scales, work on the development of empirically derived rating scales based on performance data is commensurable with this goal (e.g., Ducasse, 2010; Fulcher, Davidson, & Kemp, 2011; Knoch, 2009; Turner & Upshur, 2002; Upshur & Turner, 1995; see also Fulcher, 2012). The involvement of professionals or subject matter experts from the relevant domain in the construction and/or validation of empirically derived rating scales would suggest greater authenticity for a test in terms of its capacity to represent the domain.
Although domain analysis often informs the development of LSP test content and task type (McNamara, 1997), this process has seldom been extended to the criteria for assessment (Douglas, 2001). Research on the practices of performance assessment in workplace and educational settings has shed light on the otherwise implicit criteria that are ‘indigenous’ (Jacoby, 1998) to a given specialist domain. Examples of such research include Jacoby’s (1998) study of scientists in conference talk rehearsal and Douglas and Myers’ (2000) study of communication skills in veterinary practice. Using the example of travel service encounters, Fulcher, Davidson, and Kemp (2011) discuss how contextual and interactional data, incorporating an indigenous perspective on performance, could be exploited in the development of a domain relevant scoring instrument (i.e., ‘Performance Decision Tree’). However, it is unusual, as yet, for such knowledge of domain experts’ ‘indigenous’ criteria, or of the aspects of communication that are valued in the domain, to be applied by test developers to create a useable tool for specific purpose language assessment. Douglas (2000) cites a small number of examples where the development of LSP assessment criteria has been informed by a domain expert point of view, including the Proficiency test for language teachers: Italian (Elder, 1993a) and the Japanese language test for tour guides (NLLIA Language Testing Research Centre, 1992). The issue remains under-researched perhaps because of the challenges faced by test developers in constructing rating scales that can represent the domain expert’s perspective at the same time as constituting a useable assessment tool within a language-testing context.
One challenge for test developers in constructing rating scales that are informed by criteria indigenous to the domain lies in the contextual specificity of such criteria; the paper by Pill (this issue) considers this question. A second, linked challenge arises where rating scale development has been informed by knowledge of how real-world performances are evaluated by domain experts: the views and values of subject matter experts and language trained experts may not be commensurable. Certainly, it is not unlikely that experts from different backgrounds will ‘see the world differently’, and it appears that language experts compared with experts from other domains have different views of acceptable performance when it comes to judging language, as Jacoby and McNamara (1999) and others have observed, and as has been demonstrated in studies of the influence of professional/disciplinary background on judgements of language ability (e.g., Brown, 1995; Dias, Freedman, Medway, & Paré, 1999; Elder, 1993b, 2001). Elder (1994) discusses the rating of occupation-specific and linguistic factors in simulated classroom performance of foreign language teachers carried out by raters who had the expertise to assess both aspects of performance. She found that the raters dealt differently with the two dimensions of assessment, which, it was therefore recommended, should be reported separately. Feedback from raters intimated how characteristics of speech in test takers’ performances that demonstrated strong ‘teacherliness’ (e.g., a slower rate of delivery and language simplification for the imagined students) were nevertheless perceived as weaker linguistically. Elder suggests that including occupation-specific criteria might at least allow the linguistic assessment to be made without being contaminated by non-linguistic aspects of performance, even if the validity of the occupation-specific criteria is uncertain. Incompatibility in terms of assessment outcome between language ability and contextualized interactional skills in three specific-purpose contexts are also discussed in Elder and Brown (1997), who find that ‘[language] teachers and occupational experts appear to operate from different schemata in judging test performance’ (p. 76).
These studies, therefore, raise the question of whether or not language experts can be trained to reflect the perspective of an expert from another domain for the purposes of LSP assessment. In the case of the OET speaking sub-test, which is the focus of this study, since OET assessors are not assumed to have training in a healthcare profession, one might reasonably question the extent to which these assessors can adopt a health professional (HP) perspective when judging language in OET task performance. It is clear that the OET is not a test of clinical skills or content knowledge, yet it remains open to question whether an understanding of the nature of communication in healthcare interactions, while achieved through language, is accessible to people external to the domain.
The aim of this paper is to show how the challenge of operationalizing a domain expert perspective on the quality of communication in workplace settings was taken up in the context of a trial of two new professionally relevant speaking assessment criteria for the OET. The new criteria were derived from domain expert judgement-commentary on communicative performance by HPs/trainee HPs in healthcare settings (see Pill, 2013, and Pill, this issue). The paper addresses two research questions:
Can OET assessors confidently apply the two new professionally relevant speaking criteria and in doing so produce consistent ratings of candidates?
Does measurement of individuals against the two new professionally relevant criteria form a new and separate measurement dimension or is it part of the same measurement dimension that is defined by four existing OET speaking criteria?
The response to the first question draws on qualitative data from assessor feedback on the training they received and their experiences of applying the new criteria, and on quantitative data from Rasch analyses of test scores to establish whether assessors could produce consistent ratings of candidates against the new criteria. The second question is addressed through statistical exploration of the measurement dimensions entailed in the assessment of candidates when using the new criteria compared with the existing criteria. These analyses sought to establish the extent of alignment of the new and existing criteria in terms of the constructs they indicate.
The remainder of the paper is structured as follows. The next section explains the background to the study and introduces the format and purpose of the OET and its speaking sub-test. The paper then documents the approach taken in training OET assessors to understand and be oriented to the new criteria, and describes the process of trialling the new criteria with the trained assessors. The results of the trial are reported and evaluated qualitatively and quantitatively in response to the above research questions. The discussion considers the construct implications, and associated practical implications for score reporting, of augmenting the current assessment scheme to include professionally relevant criteria.
Background
The present study comprised part of a larger project (Elder et al., 2013) which investigated HP perspectives on aspects of communication that are most relevant to effective interactions between HPs and patients. The study findings informed recommendations for changes to the criteria and standards for the OET speaking sub-test with the goal of making the test a better match for the communicative demands of the healthcare workplace.
One outcome of the larger study was the generation of the new professionally relevant speaking assessment criteria trialled in the present study – Clinician Engagement and Management of Interaction. A linked output of the larger study was a checklist of skills/behaviours, performed through language, that were indicative of aspects of successful performance in relation to the new criteria (see below). This checklist was used as a rater training tool in the trial. Both the criteria and checklist were derived from expert-informant commentary on the performance of HPs and trainee HPs in clinical interactions with patients and in this sense can be claimed to reflect a HP perspective on aspects of effective communication in the healthcare domain. It was anticipated that the aspects of communication comprising the criteria and checklist, while professionally relevant, could also be meaningful to and useable by language-trained assessors.
The Occupational English Test
As detailed in Elder (this issue), the OET is used to assess the language readiness of overseas-trained HPs for the communication demands of entry-level healthcare practice in Australia, New Zealand, and Singapore. This paper is concerned with the profession-specific speaking sub-test of the OET, which engages candidates in unscripted role-play simulations of clinical interactions between a HP (candidate) and a patient or carer (interlocutor) performed within a five-minute time allocation. These simulation tasks are prompted by information on stimulus role-cards provided to the participants.
Role-play performances for the OET speaking test are audio-recorded and later rated on the basis of the audio alone by trained language assessors. The existing speaking criteria – labelled Intelligibility, Fluency, Appropriateness of Language, Resources of Grammar and Expression, and Overall Communicative Effectiveness (OCE) – squarely orient the assessment to linguistic features of performance without reference to specific communicative demands of the healthcare domain. The two new criteria, whose implementation is the topic of this study, are Clinician Engagement and Management of Interaction. These criteria give new attention to characteristics of the interaction between the HP candidate and the patient, rather than those of the candidate’s performance alone, and are particular to the context of HP–patient interaction. They replace the existing holistic criterion, OCE, which long-standing feedback from OET assessors had indicated to be vague and difficult to operationalize; hence the decision to exclude OCE and replace it with Clinician Engagement and Management of Interaction which offered a more explicit rendering of criteria for judging effective performance.
Method
Checklist and criteria
A pilot methodology was employed to train assessors in the systematic application of the new professionally relevant criteria. The training approach centred on use of a training tool comprising a checklist of skills/behaviours indicative of effective communication in HP–patient interactions. The checklist was created as an intermediate stage in the generation of the new criteria from the original data analysis of educators’ commentary on trainees’ performance (see Pill, this issue; for a full description of the process, see also Pill, 2013, pp. 211–214). Specific comments from the data set on a particular theme were taken and summarized to create prototype indicators for key areas of performance. A set of 24 indicators was drawn from the data. Practical constraints in the administration of the OET speaking sub-test meant that some aspects of performance were not amenable for assessment and these were therefore excluded from the checklist. One example is non-verbal communication; the OET assessors work with audio-recordings of test-taker performances, so assessment of this area is not possible at present despite its importance to the educators in their feedback. The 24 indicators were then further consolidated and abstracted into four groups and, finally, into the two new criteria. The groups of indicators relating to the use of language to demonstrate professional manner and patient awareness constitute the criterion Clinician Engagement, and the groups of indicators relating to the use of language for information-gathering and information-giving constitute the criterion Management of Interaction. (Appendix 1 gives the criteria and Appendix 2 is the checklist from Pill, 2013.) As well as having a role in the generation of the criteria, it was noted that the checklist would have an important role in orienting OET assessors to the scope of the new criteria.
Setting and participants
To help the OET assessors understand and apply the new criteria, a one-day workshop was held at the OET Centre, the site for routine rating and assessor training. The participants at the workshop were the OET assessment manager and assessment officer, seven OET-trained and experienced assessors, two workshop leaders and three observers from the project team. Assessors were recruited for the study on the basis of their availability and according to their level of experience as assessors so that the participant cohort was representative of the current OET assessor cohort. Post-workshop assessments, as described below (‘Workshop and rating procedures’) were undertaken independently and off-site by the seven OET assessors. Participating assessors were paid for workshop attendance and for subsequent independent rating of sample performances at the same rates as for assessor activities in routine administrations. Ethics approval for the study was obtained from the relevant Human Ethics Advisory Group at the University of Melbourne.
Materials and procedures
Sample speaking performances
In advance of the workshop, OET staff selected from the test database previously scored performances of 150 test takers. The criteria for selection sought to achieve: an equal number of performances for each of the three professions in the larger study (Elder et al., 2013), that is, medicine, nursing, and physiotherapy; a range of score levels representative of very strong to very weak performances; and a representative mix of candidate gender and first language background. Audio quality of performances was also taken into consideration to ensure that the training did not rely on any samples with suboptimal sound quality. Performances for the professions of medicine and nursing were selected from a single test administration. For physiotherapy, owing to the smaller numbers of candidates in this profession taking the OET, it was necessary to draw on the test database for five separate administrations. The performances were identified by a code only and no candidate details were available to the assessors. As the OET speaking test comprises two role-plays, this selection yielded 300 role-play performance samples. Six of these performance samples were purposively selected for use in the training workshop so as to represent the full data set based on the criteria listed above.
The samples were distributed as ‘take home’ assessments to be undertaken after the workshop, with each assessor completing 85–90 independent assessments. The distribution of samples was organized to ensure that each performance was rated by two different assessors. The variety of possible pairings of assessors across the available performances was maximized; this was done to strengthen the quality of the subsequent data analysis using Facets (see below). The ordering of recordings for each assessor was also varied to avoid any sequencing effect. The post-workshop assessments were completed within 10 days of the workshop.
Indicator checklist and new criteria
The new professionally relevant criteria were labelled ‘Clinician Engagement’ and ‘Management of Interaction’ (see Appendix 1). The new criteria were formulated intentionally briefly as it was the indicator checklist that would provide the necessary detail for training purposes. For rating purposes, level descriptors (at four different levels) for the new criteria were also provided to assessors in the workshop. These reflected strong, competent, not yet competent, and weak levels of performance.
To orient them to the new criteria, assessors were provided with a checklist of 24 indicators of effective or successful communication with patients reflecting the valued aspects of communication targeted by the new criteria. The checklist comprised the communication skills/behaviours associated with the performance features assessors would most likely need to attend to in applying the new criteria, illustrative examples of language use, and a small glossary of terms (see Appendix 2). The indicators were organized into two main groups of 12, with one group relating to the criterion, ‘Clinician Engagement’ and the other to the criterion, ‘Management of Interaction’. Each group of 12 indicators was further subdivided into two sub-groups reflecting two aspects of each criterion (see Appendix 1): indicators of ‘Clinician Engagement’ were subdivided into seven indicators of ‘professional manner’ and five of ‘patient awareness’; indicators of ‘Management of Interaction’ were subdivided into six indicators for ‘information-gathering’ and six for ‘information-giving’. For each indicator, assessors were instructed to choose from three categories of checkbox: ‘yes’ (skill/behaviour demonstrated), ‘no’ (skill/behaviour not demonstrated), ‘n/a’ (‘not applicable’). The ‘not applicable’ category catered to any role-play scenario that excluded demonstration of certain indicators, either owing to the specific way in which an interaction (being unscripted) may unfold, or to the nature of the given task; for example, some OET role-play tasks place emphasis on the HP (candidate) giving appropriate information to the patient or carer (interlocutor) whereas others focus on eliciting information about the patient’s condition. To ensure the checklist was used as intended (i.e., as an orientation tool and not as a scoring rubric), it was pointed out to assessors that checklist responses would not necessarily correspond with scores against the new criteria; for example, a greater number of checks for ‘yes’ would not necessarily imply a higher score.
Workshop and rating procedures
In the workshop, assessors were briefed about the aims of the larger project and the specific aims of this study. Assessors were issued with the indicator checklist and asked to complete the checklist after listening to one recorded sample. Checklist responses were compared among assessors and discussion ensued about interpretations of particular checklist items. Discussion was facilitated by the workshop leaders to ensure participants were given the opportunity to contribute equally and fully, that the order of contribution was varied, and that contributions were elaborated or clarified as necessary. This process was repeated for two more performance samples.
Once they were reasonably familiar with the new criteria via the checklist, assessors were asked to listen to a further performance, to complete the checklist and then to assign ratings against both the existing (linguistic) criteria and the new criteria. For the four existing criteria, performances were rated on six-level scales each with a short level description (as is routine practice), while for the two new criteria, performances were rated on four-level scales. The intention in using different length scales for the existing and new criteria was to limit a potential ‘halo’ effect whereby assessors may be unduly influenced by the linguistic criteria they are more familiar with, and thus award the same rating to the new criteria without giving careful consideration to the qualities of performance as described in the checklist. Checklist responses and ratings were discussed and the process was repeated for a further two OET speaking samples.
Following the workshop, each assessor listened to and rated 85–90 further performances using the checklist and six criteria (two new and four existing). The distribution of samples was organized to ensure that each performance was rated by two different assessors. The post-workshop assessments were undertaken independently. Assessors were asked to use the checklist to help them orient to the criteria for the first several performances (or as many as they felt necessary); they were also asked to refresh this orientation by initially using the checklist when resuming rating after a break.
Assessor feedback form
To gauge assessors’ reactions to the new criteria, a feedback form was administered twice. At the end of the training workshop, the form was used to collect feedback from assessors on their experiences in the workshop; subsequent to the workshop, the form was used to collect feedback on assessors’ experiences with the new criteria on completion of their independent rating of sample performances. The feedback form asked assessors to indicate their level of confidence with applying the new criteria by responding to five-level Likert items. Using open-ended questions, the form also asked assessors to comment on the quality of the workshop training and for their views on how introduction of the new criteria would impact on the scope of the test, test-taker preparation, and assessor training needs.
Data analysis
Assessor feedback
Assessor responses to the Likert items were analysed quantitatively to establish mean levels of confidence in applying the new criteria. Responses to open-ended questions about the quality of the training workshop and the possible impacts of introducing the new criteria to the OET assessment scheme were analyzed qualitatively in order to capture in summary the key aspects of the assessors’ perspectives.
Score data
The full data set of scores from the post-workshop assessments consisted of double ratings of 300 performances, that is, 85–90 sample performances by each of the seven participant assessors using six criteria. Rasch analytic procedures were used on the score data to investigate the quality of the measurement properties of the new criteria and to determine their compatibility with the existing four linguistic criteria. Rasch analysis was carried out using Facets (Linacre, 2008), an analytic software for multifaceted Rasch measurement, which takes into account the relative leniency/severity of assessor ratings. To establish whether the assessors could produce consistent ratings of candidate performance using the new criteria, two subsets of score data were each subjected to a Facets analysis of fit: the first data subset consisted of the scores on the two new criteria only, and the second consisted of the scores on all six criteria (i.e., the two new and the four existing criteria). The analyses specified three facets: candidate, rater, and item.
To investigate the measurement properties of the new criteria compared with the existing criteria, the dimensionality of the test data was first considered in terms of the results of the Facets analysis of fit (above) for scores against all six criteria: unidimensionality being a premise of the Facets formula (McNamara, 1991), a ‘good fit’ of the data to the model would point to the measurement of a single performance construct with the use of the six criteria. Counter to this, in order to test for the possibility that the two sets of criteria – new and existing – were each measuring different constructs of performance, the test data were subjected to further and more rigorous tests of unidimensionality. First, tests of linearity (correlation with correction for attenuation) and equality (chi-square) were applied to two sets of candidate measures, that is, the ability estimates (in logits) yielded by Facets analyses, respectively, of scores on the two new criteria only, and scores on the four existing criteria only. Second, a Principal Component Analysis (PCA) of Rasch residuals was carried out for the scores on the new criteria and scores on the existing criteria using data exported from Facets in a format allowing the PCA to be performed in Winsteps (Linacre, 2013), another software for Rasch measurement.
Results
Research Question 1 was addressed by investigating assessor feedback on training and confidence with the new criteria, and Rasch analyses of assessments against the two professionally relevant criteria only, and against all six criteria (i.e., the two new and four existing criteria).
Assessor feedback on training and confidence with the new criteria
The functioning of new criteria from the perspective of the participant assessors was determined by analyzing their feedback on the workshop and the subsequent rating process. Assessors were asked to provide feedback on their confidence in (a) understanding the meaning of the two new criteria, (b) using the checklist, and (c) applying the level descriptors for each of the criteria. The findings, summarized in Table 1, suggest a satisfactory degree of confidence in understanding the new criteria, using the checklist and applying the level descriptors, with mean scores for each Likert item between 3.6 and 4.0 on a scale of 1 to 5 (where 1 indicated ‘not at all’ and 5 indicated ‘extremely’).
Self-reported rater confidence; 1 (not at all) to 5 (extremely).
Open-ended comments from assessors revealed considerable enthusiasm for the new criteria. Assessors indicated that these offered more explicit guidance to them than the existing OCE criterion (not used in the study), which they reported to be rather vague. The new criteria were also viewed as encouraging the HP (candidate) to focus on the needs and concerns of the patient or carer (interlocutor) and as acknowledging interactive skills rather than simply focusing on an individual’s language competence. Overall, the feedback on using the checklist was also positive. However, despite the advice given in the workshop, two assessors referred to some difficulty in reconciling their checklist responses with their ratings, since the number of ‘yes,’ ‘no,’ or ‘n/a’ responses for each group of indicators did not directly correspond to the score given for the relevant criterion. Four assessors also commented on the difficulty, at least initially, of attending to the large number of indicators on the checklist and reported spending more time on each performance as a result.
The feedback also indicated that assessors were generally satisfied with the training they had received in the workshop. One assessor’s comment suggested that the sample did not seem to represent performance at the extreme ends of the possible range of candidate ability: It would be interesting to see how the very good, i.e., ‘6’ and poorer ‘3/4’ candidates measure up with the new criteria (Assessor 1). Assessors’ suggestions for further support to improve their familiarity with the new criteria, in addition to opportunities for more practice, included the provision of normative feedback by way of information on the scoring behaviour of the other participant assessors.
Facets fit analysis of assessment using two new criteria and all six criteria
Facets fit analyses of two sets of score data were conducted to investigate whether the language assessors were able to use the new criteria consistently: one data set consisted of scores derived from use of the two new criteria only, and the other of scores from use of all six criteria. Table 2 gives the results of the Facets analyses of fit for candidates (performances), raters (assessors) and items (criteria).
Fit statistics: items, raters and candidates.
In terms of the analysis of scores using the two new criteria only, the fit statistics do not display misfit values (defined as Infit mean square indices greater than 1.4 or less than 0.6, ZStd greater than +/-2) for raters (Infit mean = .99, SD = .08) or items (Infit mean = .99, SD = .03). These indices show that the assessors can apply the two professionally relevant new criteria consistently and no criterion items or assessors are misfitting. The number of misfitting candidates (i.e., performances yielding ability measures that were unexpected in terms of the model) is low (n = 6) and within reasonable limits (Pollitt & Hutchinson, 1987).
In sum, the analysis of assessor feedback on training and confidence with the new criteria, and the Rasch analyses of assessment using the two new criteria and all six (two new and four existing) criteria indicate that the language assessors are able to use the new professionally relevant criteria confidently and consistently.
Research Question 2 sought to establish the extent to which the new criteria, when used thus successfully by trained language assessors, measure part of the same performance construct inherent in the four existing criteria or whether they add a new dimension to the measurement thereby expanding or otherwise augmenting the performance construct of the test.
Testing the dimensionality of measurement with the new and existing criteria
The measurement dimensions of the score data were explored in a series of tests of unidimensionality, an assumption underlying the Rasch model used to evaluate the psychometric properties of the new criteria. The first test of unidimensionality was the level of fit to the Rasch model of the full data set of scores described above (i.e., measurement of candidates against all six criteria). Three further analyses comprising more rigorous challenges to the assumption of unidimensionality were then carried out on two data subsets (i.e., measurements of candidates on the two new criteria and on the four existing criteria): tests of linearity, equality, and a PCA of Rasch residuals.
The Facets fit analysis of scores on all six criteria (existing and new) was carried out by entering the scores into the model as a single data set. Results showed a very high level of model fit, explaining 75.3% of the variance in scores, which would tend to suggest that the two sets of criteria were collectively measuring a single underlying construct of ability, thereby providing evidence in support of the assumption of unidimensionality.
Estimates of candidate ability according to the new and the existing criteria were compared to test this assumption further. To obtain candidate ability estimates derived from each set of criteria, the subsets of score data on the new and existing criteria were subjected to Facets analysis independently. The two sets of candidate measures (in logits) yielded by these separate analyses were then correlated. The resultant coefficient of 0.82, represented graphically in the scatterplot (Figure 1), shows that, although the two sets of criteria were ranking candidates with the degree of similarity that would be expected from a typical pair of judges rating candidates, the two sets of candidate measures were not perfectly correlated.

Relationship between scores (in logits) on the new and the existing criteria.
This result would seem to provide evidence against the unidimensionality of the data. However, it is possible that the reliability of the two sets of candidate measures could account for this result since the person separation reliability statistics for the two data subsets were not the same: .81 according to scores derived from the two new criteria, compared with .93 for scores derived from the four existing criteria (see Table 3).
Summary statistics from each Facets analysis.
N = 150 because the set of 300 performances included two performances for 150 candidates, assessed independently. In the analysis, the performances were considered as representing the same candidate.
The lower person separation reliability statistic for scores against the new criteria may indicate that the assessors are less consistent in their scoring patterns on the new criteria compared with the existing criteria. To take account of the reliability of the two sets of ability estimates, the correlation was corrected for attenuation using the following formula (McNamara, 1991, following Henning, 1987, pp. 85–86) where rxy is the observed correlation, uncorrected and rxx and ryy are the reliability estimates of the variables:
The correlation corrected for attenuation was 0.94, which again shows that the correlation between candidate measures against the two sets of criteria, although high, is imperfect. The result of this second test of unidimensionality therefore fails to show that the new and the existing criteria are measuring a single construct of speaking ability.
To further test the compatibility of scores derived from the new and the existing criteria, a second test of the equality of the ability estimates derived from the two sets of criteria was undertaken in the form of a chi-square test using procedures reported in McNamara (1991, pp. 153–154). As a test of the statistical significance of the relationship between two variables, generally, the greater the chi-square statistic, the lower the probability of Type 1 error in rejecting the null hypothesis that there is no significant difference between two variables, while a small p value provides strong evidence in favour of rejecting the null hypothesis. The resulting value in this case of X2 = 214.5747, df = 149, p < .01, therefore, requires the rejection of the hypothesis that there is no statistically significant difference between the ability estimates derived from the new compared with the existing criteria; the test to show that the two sets of criteria are measuring a single construct of speaking ability fails. Counter to the findings of the Facets fit analysis (in support of the unidimensionality assumption), the results of the correlation and chi-square tests of alignment of the two sets of candidate measures provide evidence of a degree of multidimensionality in the score data derived from the two sets of criteria combined.
In a final test of the unidimensionality assumption, a PCA of Rasch residuals was undertaken using the Winsteps program (Linacre, 2013) to determine whether score variance that was not explained by the single construct assumed in the Rasch model was attributable simply to random error or instead to the presence of other constructs. The results (presented in Figure 2) show that the strongest ‘other’ measure, apart from the primary construct, is related to the two new criteria (numbered ‘1’ and ‘2’ in the top left quadrant). With a cautious interpretation of this finding, given that the contrast (1.7 eigenvalues) is relatively weak (see Raîche, 2005), it seems reasonable to suggest that the new criteria, while overlapping substantially with the existing criteria, are also adding a new dimension to the measurement model or, in other words, extending the scope of the test construct.

Plot of standardized residual contrast (1, 2: new criteria; 3–6: existing criteria).
Taken together, the results of the tests of linearity and equality and the PCA of residuals present a challenge to the results of the first test of unidimensionality (Facets analysis of fit of the score data for all six criteria). They show that candidate ability estimates based on each of the two sets of criteria, new and existing, cannot be assumed to be perfectly correlated or equal and that differences between the estimates are likely to be owing to the presence of a new, or expanded, construct rather than random error of measurement.
Summary of results
Feedback from a sample of OET language assessors, following a training session and subsequent rating practice, indicated a generally positive disposition towards the potential introduction of new professionally relevant criteria (Clinician Engagement and Management of Interaction) to the assessment scheme for the OET speaking sub-test and to the use of an indicator checklist to support their orientation towards the new criteria. The feedback also indicated satisfactory levels of confidence amongst assessors in understanding the new criteria, using the checklist, and applying the new criteria together with four existing linguistic criteria (Intelligibility, Fluency, Appropriateness of Language, and Resources of Grammar and Expression) to rate OET speaking performances consistently. Assessors’ impressions were supported by the results of Facets analyses of fit. Results of statistical analyses of score data suggested that the new criteria are not entirely psychometrically compatible with the existing linguistic criteria. Scores derived from the new criteria, although highly correlated with the existing criteria and showing a good level of fit to the Rasch model, at the same time seem to be adding something new to the assessment, with the results of a PCA of residuals suggesting that score variance not explained by the primary construct is largely attributable to the new criteria.
Discussion
This study has explored the extent to which the challenge of training language experts to represent aspects of a HP perspective could be met in the context of a trial of new professionally relevant assessment criteria for the OET speaking sub-test. On the basis of assessor feedback and the measurement properties observed for the new criteria, it appears that OET assessors were able to use the professionally relevant criteria with confidence and to apply them successfully to make consistent judgements of candidate ability. The incorporation of a domain expert perspective with use of the new professionally relevant criteria represents an extension of the scope of the assessment in line with the aim of the larger project (Elder et al., 2013) of making the test a better match for the communicative demands of the healthcare workplace. At the same time, the study has found evidence for multidimensionality of measurement in this expanded test construct, showing that the new criteria embody a measurement dimension that is separate from, although highly correlated with, that entailed in the use of the existing criteria. The presence of multidimensionality in the test data (associated with the new criteria) has implications for scores and score reporting on the test since, as Adams, McNamara, and Zammit (1998) have argued, to ignore multidimensionality in test score data may have consequences for the classification of candidates (see below). Beyond issues of score reporting, adoption of the new criteria would entail consideration of assessor and interlocutor training, and appropriate uses of the indicator checklist in the longer term. The implications of an expanded speaking construct for the OET as proposed here could be expected to extend to candidate preparation (where a positive washback might be anticipated), and to the education of test users to enable them to adjust their expectations of what the test results might predict about future workplace performance.
The main options for the reporting of candidate scores if the new criteria are to be introduced in a revised OET are a single score which subsumes scores on the current linguistic criteria and scores on the new criteria, or separate scores on the existing and new criteria. To the extent that the new and existing criteria were found to be somewhat psychometrically compatible, there would be some justification in reporting results as a single score. However, in view of the evidence for the presence of multidimensionality in the score data derived from ratings on both sets of criteria, there would seem to be a stronger case for reporting scores on the new and existing criteria separately. On the basis of the results above, it cannot be claimed that the separate estimates of candidate ability, based on the new and the existing criteria, are identical. To ignore the presence of multidimensionality, then, carries its own risk of misclassification of candidates, in addition to that associated with random error of measurement or with less than perfect test reliability (Adams, McNamara, & Zammit, 1998). The extent of misclassification that might be attributable to reporting a single score for each candidate was not explored in this study. However, in addition to the potential for reducing measurement error, further benefits of separately reported scores might be anticipated. Although the consequences of reporting the scores separately might include costs to the test provider (e.g., time in processing and delivering results), there could be benefits to test users in the greater diagnostic power associated with more nuanced information about candidate ability, including the more professionally relevant aspects of performance captured by the new criteria. It is noted, however, that this would need to be balanced with education for test users in interpreting and using the additional information provided.
Alternatively, a different approach could be taken to exploiting the diagnostic potential of the new criteria: instead of counting towards a candidate’s test result, assessment of performance on the new criteria could be used for diagnostic purposes only, as feedback for candidates and/or information for employers that could help them to identify training/support needs. However, the potential for information intended for diagnostic purposes in this context to unfairly disadvantage some test takers would need to be given careful consideration. If not contributing to the grade but still released to test users, it is possible that diagnostic information could be used as a quasi-test result, and without transparency, for selection for employment or other gatekeeping purposes.
Adoption of the new criteria would imply extensive assessor training, given that training of the entire cohort of OET assessors in the application of the new criteria would need to be undertaken. In this regard, the indicator checklist used in this study would appear to be promising as a training tool. The checklist has further potential to be used in candidate test preparation, with positive washback to be expected if test-taker preparation is a better match for the communicative demands of the workplace as a result.
Introducing the new criteria would also have implications for the training of interlocutors. It was observed in feedback comments from participant assessors that, in some interactions, the patient (interlocutor) ‘takes over,’ making it difficult for the HP (candidate) to demonstrate some of the skills/behaviours associated with the new criteria; for example: The interlocutor also often robs the candidate of any opportunity to elicit the patient’s perspective by providing all the information (Assessor 5) (see also Woodward-Kron & Elder, this issue). In this sense, the new criteria place a greater burden on interlocutor consistency which would need to be addressed in training to ensure candidates have fair and equal opportunities to demonstrate the associated skills/behaviours.
Conclusion
This study has shown how language assessors were trained with the aid of a checklist to use new professionally relevant criteria, in conjunction with existing linguistically oriented criteria, to rate speaking performances on a specific-purpose language test for HPs, the OET. Analyses of the score data from these ratings suggest that the new criteria are compatible with the existing criteria at the same time as adding something new to the assessment: developed to reflect domain expert views on aspects of successful communication in healthcare, the new criteria represent an extension of the scope of the test. That the language assessors were trained to use the new criteria successfully (i.e., with sufficient consistency to produce orderly estimates of ability) shows it is possible for the speaking construct, when expanded to include a health professional perspective, to be operationalized in the context of this test. However, an exploration of the dimensionality of test scores derived from the new and existing criteria shows that, in spite of a degree of psychometric compatibility with the existing criteria, the new criteria are measuring a sufficiently different construct of speaking ability to warrant some form of separate reporting of candidate performance against the new criteria.
There are limitations to the study which suggest issues to be addressed in future investigations of if, and how, the new criteria ought to be introduced into the OET speaking assessment scheme. First, it is noted that the findings of the study may not be generalizable to professions other than medicine, nursing and physiotherapy; that is, the new criteria may not reflect the values of professions other than the three represented by the informants in the larger study (Elder et al., 2013; Pill, this issue). To establish their relevance to the nine other health professions currently served by the OET, the new criteria need to be validated with relevant domain experts. Likewise, the extent to which the new criteria are applicable to role-play tasks across the full range of professions served by the OET needs to be established empirically by trialling the criteria on performances on a wider sample of existing OET role-play tasks. Second, in terms of the performances rated in the study, sampling limitations may have affected the results. As was noted above, one participating assessor was of the view that the range of candidate ability in the sample performances was not representative of the test-taker population.
The findings of the study also suggest areas for further research into assessor cognition in LSP testing. The proposition that criteria informed by domain expert views on the quality of communication in healthcare settings are useable by language-expert raters in a specific-purpose language test has been tested in this study, and findings have been presented indicating this to be possible in a particular testing context. However, the extent to which the language assessors actually adopted a HP perspective remains unclear: do the study results provide evidence that the language assessors were attuned to the same aspects of performance as those valued by HPs, or do they simply show that the OET assessors were trained to be reliable in using the new criteria?
In conclusion, based on the findings of this study, it would seem feasible for language-trained assessors to use professionally relevant criteria in a revised speaking assessment scheme for the OET. The findings invite further research on the relevance of the new criteria to all professions served by the OET. More generally, the attempt described here to extend the scope of a speaking assessment (by adding criteria reflecting the values of domain experts) suggests a promising approach to improving the alignment between specific-purpose language tests and the communicative demands of domains of interest.
Footnotes
Appendix 1: New criteria for speaking assessment ( Pill,2013,pp. 213–214)
Appendix 2: Indicator checklist ( Pill,2013,pp. 268–269)
Acknowledgements
We wish to acknowledge the OET assessors who participated in this study and the OET Centre for hosting the workshop. Particular gratitude is due to Fusae Nojima for extracting and copying the audio samples.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Australian Research Council [Linkage grant number LP0991153].
