Abstract

This review departs from current practice in reviewing tests in that it employs an ‘argument-based approach’ to test validation to guide the review (e.g. Bachman, 2005; Kane, 2006; Mislevy, Steinberg, & Almond, 2002). Specifically, it follows an approach to test development and use that Bachman and Palmer (2010) call the process of ‘assessment justification’. This process focuses on investigating the extent to which the intended uses of a particular test can be justified to stakeholders. The justification process includes two interrelated activities. The first is articulating an assessment use argument (AUA; Bachman, 2003, 2005; Bachman & Palmer, 2010), which makes explicit claims that link test takers’ performance to the consequences of test use. The second activity consists of collecting evidence to support the claims that are articulated in the AUA.
The structure of an AUA consists of a series of statements – claims and warrants – through which the test developer articulates, or makes explicit, her or his intended inferences: from test takers’ performance to their test scores, to interpretations about their ability, to the decisions that are made, to the consequences of test use. These claims and warrants are subject to rebuttals, or counterclaims that the test developer may anticipate and articulate, or that various stakeholders may articulate at any point during the operational life of the test. Rebuttals may occur when evidence that weakens the claim or warrant (what we will call ‘negative evidence’) is observed. Rebuttals may also be anticipated when the observed evidence that has been provided to support the claim or warrant (‘positive evidence’) is judged to be of dubious quality or insufficient to fully support the claim or warrant. As with claims and warrants in an AUA, rebuttals, whether articulated by the test developer or by one or more stakeholders, need to be supported by evidence. In particular, rebuttals articulated by stakeholders call for further investigation by the test developer. Thus, when a rebuttal is articulated by a stakeholder, the test developer will need to evaluate the extent to which this is supported by evidence, and respond by taking appropriate action to mitigate the consequences of the rebuttal on test use or by providing evidence that weakens the rebuttal and thus strengthens the claims and warrants in the AUA.
The process of justification is essentially local: Bachman and Palmer (2010, p. 438) argue that the test developer should articulate and develop a specific AUA for each intended use of the test. Judgments about the extent to which the uses of a given test are justified are local as well (Bachman & Palmer, 2010, p. 438). These judgments may be influenced by a variety of contextual factors including but not limited to the types of stakeholders (e.g. test takers, parents, admission officers, and university professors), the stakes of the test, the priorities and regulations of the local educational institutions, the availability of resources, and the cultural, societal, and educational value systems of the stakeholders. Given the local nature of the justification process, it is neither the authors’ intention, nor is it possible, for this test review to form any judgment about a test use. Rather, this review aims to examine the evidence that is provided by the test developer about the intended uses of the test so that readers and test users will be better informed in making their own judgments.
PTEA: The Pearson Test of English Academic
Test purpose: The primary intended use of the PTEA is to provide score reports that can be used to make admission decisions about a student’s readiness for tertiary academic study in an English-medium environment. Although other uses have been identified, 1 this review will focus on this primary intended use of the PTEA.
Length and administration: The 3-hour test is computer-administered in the Pearson VUE’s test centers.
Scores and scoring procedures: All the test responses, including constructed spoken and written responses, are computer scored. The score reports present an Overall score, Communicative Skills scores (Listening, Reading, Speaking, and Writing), and Enabling Skills scores (Grammar, Oral Fluency, Pronunciation, Spelling, Vocabulary, and Written Discourse). IRT parameters of the test items were calibrated with field test data prior to operational test administrations, using Masters’ partial credit model (Masters, 1982, cited in Pearson, 2008a, p. 29).
Author and publisher: Pearson Education.
Contact information: Available at the PTE official website. (http://pearsonpte.com)
Price: The test registration fee varies by countries (and regions). Detailed information is available at the PTE official website.
Brief description: PTEA is a computer-based English language test that consists of 20 item types in three sections: Speaking and Writing, Reading, and Listening. Detailed information about the test can be found at the PTE official website and in The Official Guide to PTE Academic (2010).
Review: Investigating the use of the PTEA for making academic admissions decisions
As a starting place for this review, an AUA has been constructed for the primary test use – making admissions decisions. It is suggested that the test developer should articulate a specific AUA for every instance of the test use. The AUA constructed in this review contains three elements: (1) claims and warrants that are desired for the intended test use; (2) supporting evidence and rebuttal data that have been identified from documents provided by the test developer; and (3) potential rebuttals, when noted, for further investigation. The potential rebuttals are not meant to be exhaustive, as additional rebuttals may be articulated by various stakeholders or stakeholder groups over the operational life of the test.
As mentioned above, claims and warrants in an AUA need to be supported by backing, or evidence. The evidence presented in the review has been gathered primarily from documents that were provided by Pearson, the developer of PTEA, at the time of the review. We note that this evidence is incomplete for the purpose of evaluating the AUA that underlies this test. The lack of complete evidence may be due, in part, to the fact that the publisher did not collect this with a specific AUA in mind, and in part to the relative newness of the test, so that little research has been conducted on the test outside of the research that went into its development. It is our hope that articulating an AUA for one specific use of this test and pointing to places where evidence is insufficient will provide guidance to both the test developer and users of this test in collecting evidence, both positive and negative, in the future. We hope this will also guide the test developer in articulating specific AUAs for the other uses of the test.
Assessment records 3
Claim: PTEA scores are consistent across different assessment tasks, different aspects of the assessment procedure, and across different groups of test takers.
Consistency
Warrant 1: Procedures for administering the PTEA are followed consistently across different occasions and for all test taker groups.
Evidence: It is explicitly documented that PTEA needs to follow strict administrative procedures. The test developers have reviewed the development process of PTEA with reference to the European Association for Language Testing and Assessment (EALTA) guidelines (EALTA, 2006). The review (De Jong & Zheng, 2011) reports that all test administrators have been trained in using materials available on the Pearson VUE Support Services website and have passed an online open-book test on the materials before they are allowed to administer the test.
Warrant 2: Procedures for producing the PTEA scores are well specified and are adhered to.
Evidence: The criteria and procedures for the human scoring of each of the operational item types are described in a series of internal documents presented to PTEA TAG. These criteria and procedures include the traits that were scored for each item type, as well as the descriptors of each score level.
Psychometric qualities of the test items, such as the item difficulty and discrimination, were thoroughly investigated and used as a basis for revisions of scoring rules used by human raters. Internal reviews of the scoring rules were documented, and problems that were identified appear to have been addressed and resolved properly. In addition, quality control procedures for human scoring practices were proposed in an internal document (Pearson, 2009).
Potential rebuttal(s): No information is available regarding (1) the extent to which the proposal about quality control for human scoring has been accepted and implemented and (2) the effectiveness of the proposed quality control. Since automated scoring, rather than human scoring, is used for the PTEA operational administrations, there could also be questions about the extent to which the scoring criteria and procedures used for human scoring in the field test correspond to the automated scoring in the operational tests.
Warrant 3: The automated scoring algorithms for scoring written and spoken test responses were developed through trialing and comparison with multiple human ratings.
Warrant 4: The automated scoring algorithms for written and spoken test responses were developed through trialing with several different groups of test takers.
Evidence: PTEA employs two automated scoring systems, Intelligent Essay Assessor (IEA) and the Ordinate scoring system (OSS), for its written and spoken sections, respectively. It is documented that both IEA and OSS went through rigorous training and norming stages using large numbers of essays and speech samples that were rated by two trained human raters and by an adjudicator who provided a third score in the case of disagreement.
Potential rebuttal(s): No evidence is available from the review document regarding (1) the frequency of ongoing training and calibration of the automated scoring systems, and (2) the results of the ongoing training and calibration.
Warrant 5: Scores on different tasks in PTEA are internally consistent (internal consistency reliability).
Evidence: Through a series of item analyses, items that were not highly correlated with the total score, or whose model fits were not satisfactory, were revised or deleted. Reliability estimates were provided for the overall test and for each section score within the PTEA score range of 53 to 79, which was claimed by Pearson to be the most relevant for admission decisions. All the reported reliability estimates within the specified score range were higher than .90.
Warrant 6: Ratings of different raters are consistent (inter-rater reliability).
Warrant 7: Different ratings by the same rater are consistent (intra-rater reliability).
Evidence: Because automated scoring is used, inconsistencies between and within human raters are not a source of measurement error.
Potential rebuttal(s): Documentation describing what constitutes potentially problematic responses for the automated scoring algorithms, how they are identified, and what the procedures are for handling these is not available. If human raters are involved in scoring the problematic responses, it is unclear to what extent their inter-rater and intra-rater consistency would be controlled.
Warrant 8: Scores from different forms of PTEA are consistent (equivalent forms reliability).
Evidence: It is documented that different forms of PTEA consist of stratified random samples from a calibrated item pool such that each form has equivalent test information functions across all the sections. Consistency in scores across test forms was observed in results from the field test in 2008.
Potential rebuttal(s): Evidence about equivalence from operational test administrations is not available from the reviewed documents.
Warrant 9: Scores from different administrations of PTEA are consistent (test-retest reliability).
Evidence: Test-retest reliability estimates are available from the second field test involving around 4,000 university students. The overall test-retest reliability estimate was .96, and the test-retest reliability estimate for each section ranged from .87 (Reading) to .94 (Listening).
Potential rebuttal(s): Evidence from operational test administrations is not available from the reviewed documents.
Warrant 10: Scores are consistent across different groups of test takers.
Evidence: All the test tasks are computer scored. The scoring algorithm is the same across all groups of test takers.
Interpretations
Claim: PTEA scores can be interpreted as indicators of test takers’ levels of English proficiency for communicating in English-medium tertiary-level academic settings (hereafter referred to as ‘the test construct’). Such interpretations are meaningful with respect to an analysis of authentic English language use in academic settings, impartial to all groups of test takers, generalizable to language use tasks in English-medium tertiary-level academic settings, and relevant to the admission decisions to be made at ‘education institutions, and professional and government organizations that require a standard of academic English for admission purposes’ (Pearson Education Asia, 2010, p. ix). Warrants A1–A7 address the quality of meaningfulness, B1–B5, impartiality, C1–C2, generalizability, and D, relevance, while sufficiency is not claimed for this test. 4
Meaningfulness
Warrant A1: The skills defined in the construct are based on an analysis of authentic English language use in academic settings. The construct definition clearly distinguishes the construct from other related constructs, such as general English proficiency or workplace English.
Evidence: According to the test developer, the design of the test tasks was ‘based on naturalistic examples of English language use in academic settings’ (Pearson, 2006, p. 1). The tasks are intended to involve ‘[c]ommunicative language skills … for reception, production and interaction in the oral and written modes’ (p. 1) because ‘these skills are needed to successfully follow courses and actively participate in tertiary level education where English is [the] language of instruction’ (p. 1).
Potential rebuttal(s): No details are available for checking the correspondence between the specific aspects of ability that have been defined in the test construct and the areas that were deemed important in any needs analysis that was conducted by the test developer.
Warrant A2: The assessment task specifications clearly describe characteristics of the tasks that a test taker will perform in the test.
Evidence: The task characteristics have been clearly and thoroughly specified in the internal test design documents, item writer guidelines, and item writing specifications.
Warrant A3: The procedures for administering the test enable test takers to perform at their highest level to demonstrate their English proficiency for communicating in English-medium tertiary-level academic settings.
Evidence: Test taker feedback from a field test indicates that the test directions are clear and efficient, and the test is easy to navigate. Observed issues in the feedback about the test administration and test environment have reportedly been reviewed and addressed in test revisions (Zheng & De Jong, 2011, p. 21).
Potential rebuttal(s): No documentation is available to demonstrate the extent to which the observed issues have been effectively solved in the operational tests.
Warrant A4: The scoring keys and procedures focus on the specific aspects of ability that are relevant to the English proficiency for communicating in English-medium tertiary-level academic settings.
Evidence: All the tasks are automatically scored. The aspects of performance to be scored are clearly documented in The Official Guide to PTE Academic (2010). The performance of the key and distractors of each multiple-choice item was monitored and controlled statistically to minimize construct-irrelevant variance. A description of and validity information about the automated scoring systems for the speaking and writing tasks are provided in a published document (Pearson, 2011). The reported correlations between trained human raters and the automated scoring systems were .88 and .96 for the writing and speaking total scores, respectively. These correlations are comparable to those among human raters. Security measures were also implemented to increase the scoring engines’ robustness to cheating.
Potential rebuttal(s): No details about the types of measured linguistic variables and the weight of each variable in the scoring algorithm are available for evaluating the correspondence of the scoring models to the construct.
Warrant A5: The test tasks engage test takers’ English proficiency for communicating in English-medium tertiary-level academic settings.
Evidence: The specific aspects of the test construct have been defined in the test specifications. Rigorous measures were implemented to train item writers and to monitor the quality of the items. Items were reviewed regarding their adherence to the test specifications and content requirements. Items that failed to meet the quality review criteria, or were flagged for statistically poor item–total correlations or inappropriate distracters, were removed from the item pool. Items were also removed if the proportion of native speakers who answered the item correct was lower than that of non-native speakers. (Zheng & De Jong, 2011, pp. 21–23) These efforts increase the likelihood that the test tasks could properly address target knowledge and skills in the test construct.
Verbal protocols were collected during a pilot test to investigate the cognitive processes test takers employed. The results helped the test developer to evaluate the test platform and test tasks. Test taker feedback collected from a field test supports the warrant that the test is a good measure of the test construct. The extent to which the test tasks elicit academic vocabulary in test takers’ written production was examined by the percentage of Academic Word List (AWL) tokens and types used in the test taker responses. The results show that the items as a group tended to elicit academic vocabulary, but the performance of individual items varied, and not all items were successful in eliciting a minimum of 4% of AWL words – a criterion used in the study as an indicator of authentic academic text (O’Loughlin, 2008, p. 3; Pearson, 2010, p. 2). The research study recommended that the task inputs be monitored as part of the on-going item development process to ‘ascertain the percentage of AWL tokens’ (O’Loughlin, 2008, p. 4).
Potential rebuttal(s): No details are available regarding the extent to which the issues observed from the verbal protocol study have been resolved. Empirical evidence is not available from the operational tests to demonstrate that any issues observed from the field test survey have been adequately resolved. There is also a lack of evidence showing that the suggested approach for monitoring task inputs has been implemented. Finally, no information is available regarding the proportion of academic vocabulary elicited in test takers’ oral responses.
Warrant A6: Scores on the test can be interpreted as indicators of test construct.
Evidence: The correlations of scores between the PTEA and two other tests – TOEFL iBT and the IELTS – fell into the .73~.95 range (Zheng & De Jong, 2011). As both the TOEFL iBT and the IELTS aim to measure communicative English language proficiency, the medium to high score correlations suggest that PTEA may have similar score interpretations.
Results from dimensionality analyses (Pearson, 2008b; Pae, 2011a) suggest the existence of a dominant general factor. The results support the expectation that one general construct is measured in the test. In addition, results from confirmatory factor analyses (Pae, 2011b) suggest that both receptive and expressive language skills were represented in the measured construct.
Rebuttal(s): Results from a multitrait-multimethod (MTMM) study (Pae, 2011c) suggest a method effect of the multiple-choice question format in measuring reading and listening. The evidence indicates existence of construct-irrelevant factors.
Warrant A7: Pearson communicates the definition of the construct in non-technical language via the test instructions, examples, and test preparation materials. The construct is also defined in non-technical language in the assessment report for test takers and other stakeholders.
Evidence: The sample test directions clearly describe the expected performance. Examples and detailed explanations of the construct are provided in non-technical language in the official guide, test taker handbook, and tutorials. Practice tests available on the test website may help test takers understand the test construct. In addition, the construct is clearly defined in non-technical language in the score report.
Impartiality
Warrant B1: The test tasks do not include response formats or content that may either favor or disfavor some test takers.
Evidence: Materials are available on the test website to help test takers familiarize themselves with the task format before taking the test. Item bias reviews are designed to effectively prevent content that may introduce bias. Items that showed differential item functioning (DIF) were also identified. Concerns about construct-irrelevant factors such as time management and unfamiliarity with test format that test takers raised in feedback collected from a field trial helped identify potential bias in task formats and content. These concerns were addressed by test revisions (Zheng & De Jong, 2011, p. 21).
Potential rebuttal(s): No empirical evidence is available to demonstrate the extent to which the identified issues have been resolved in the operational tests.
Warrant B2: The test tasks do not include content that may be offensive (topically, culturally, or linguistically inappropriate) to some test takers.
Evidence: Bias and item sensitivity reviews were implemented. Items that were regarded as inappropriate, offensive, or emotionally charged for certain groups of test takers were removed.
Warrant B3: The procedures for producing the test scores are clearly described in terms that are understandable to all test takers.
Evidence: High-level scoring rules and rubrics are clearly described in the official guide (Pearson Education Asia, 2010). General information regarding the development and features of the automated scoring systems are provided in Pearson (2011). Both of these documents are publicly available from the test website.
Potential rebuttal(s): There is no evidence that detailed information about the automated scoring systems, such as the scoring algorithms, the attributes that are implemented to derive scores, the magnitude of training error and testing error, and optimization processes, has been provided or explained to test takers.
Warrant B4: Test takers are treated impartially during all aspects of the administration of the test.
Test takers have equal access to information about the test content and test procedures.
Test takers have equal access to the test, in terms of cost, location, and familiarity with conditions and equipment.
Test takers have equal opportunity to demonstrate their ‘English language proficiency for communication in tertiary level academic settings’.
Evidence: Descriptions of the test content and procedures are presented in test familiarization materials and available from the test website.
The test is given on multiple occasions. A test date can be scheduled online. The test is administered in the Pearson VUE’s test centers which are globally available in multiple locations.
The test is standardized and administered online. Therefore, test takers are tested under similar conditions. Services are also available at the test centers to accommodate test takers with disabilities.
Potential rebuttal(s): Inequality in test access may occur with respect to travel expenses, test takers’ familiarity with the conditions and equipment of the test centers, and their ability to pay the test fee. In addition, no evidence of quality control on the test centers’ conditions and adherence to expected administration standards is available.
Warrant B5: Interpretations of the test construct are equally meaningful across different groups of test takers.
Evidence: Results from multi-group analyses (Pae, 2011b) suggest measurement and structural invariance across gender.
Rebuttal(s): Results from the same study indicate different factor dimensionalities across ability groups, which suggests that the test may measure the test construct in different ways for test takers at different levels of ability. In addition, different score profiles across skill areas were found between Chinese and Indian students (Zheng & Wei, 2011), which calls into question the appropriateness of using the same set of item parameters for the two groups and the meaningfulness of their score interpretations.
Generalizability 5
Warrant C1: The characteristics of the test tasks correspond closely to those of language use tasks in English-medium tertiary-level academic settings.
Evidence: Task characteristics that are supposed to reflect those of the target language use (TLU) domains are clearly and thoroughly specified in the test specifications for each task. Item writers are directed to maximize generalizability to the TLU domains during content development. Authenticity (generalizability, in our terms) is also listed as one of the key item review criteria.
Warrant C2: The criteria and procedures for evaluating the responses to the test tasks correspond closely to those that are typically used in assessing language performance in tasks in English-medium tertiary-level academic settings.
Evidence: The evidence listed under Warrant A4 above may provide some indirect evidence for judging the viability of this warrant.
Potential rebuttal(s): No direct evidence is available regarding the degree of the correspondence in scoring criteria and procedures between the test tasks and tasks in real-life academic settings.
Relevance
Warrant D: Interpretations of the test construct provide information that is relevant to the admission decisions made at ‘education institutions, and professional and government organizations that require a standard of academic English for admission purposes’.
Evidence: The aforementioned evidence under Warrant C1 supports the relevance of the test construct to student success in tertiary-level academic settings. Additionally, expected student performance in the context of university-level education is provided to test users for each test score range. The described correspondence suggests the relevance of the test construct to the target admission decisions.
In situations where CEFR levels are used in the target admission decisions, correspondence between PTEA test scores and CEFR levels may support the relevance of the construct to the admission decisions. The alignment of PTEA score ranges to CEFR levels was established statistically using both a test taker-centered and item-centered approach (Zheng & De Jong, 2011, p. 32).
Potential rebuttal(s): No evidence is available that scores from PTEA are actually useful to college admission officers in making admission decisions. No technical details are available regarding either the needs analysis or establishment of the PTE score range correspondence to the expected student performance. Only five items from three item types, Write essay, Describe image, and Re-tell lecture, were used in the test taker-centered approach to aligning the test to CEFR levels. The small number of the selected items and the restriction to three item types may affect the generalization of the linking results.
Decisions
Claim: Admission decisions that are based in part on PTEA scores are sensitive to local educational and societal values, and are equitable for test takers who are granted or denied admission into the tertiary-level programs for which they applied. Warrants A1–A3 below pertain to values-sensitivity, and B1–B2, equitability.
Values sensitivity
Warrant A1: Existing educational and societal values and relevant legal requirements are carefully and critically considered in the admission decisions that are to be made.
Warrant A2: Existing educational and societal values and relevant legal requirements are carefully and critically considered in determining the relative seriousness of false positive and false negative classification errors.
Evidence: None is available in the documents that were reviewed.
Potential rebuttal(s): Admission decisions based on PTEA scores are made by tertiary-level program admissions officers or committees. There is no direct evidence from the review that indicates that these stakeholders consider local values in making decisions based on PTEA scores.
Warrant A3: Cut scores are set so as to minimize the most serious classification errors.
Evidence: Pearson publications (e.g. De Jong & Zheng, 2011) note the availability of a Standard Setting Kit to facilitate the setting of local standards by educational institutions.
Potential rebuttal(s): Content of the Standard Setting Kit is not available for review; therefore, the usefulness of the kit is unclear. It is also unclear how most institutions and organizations set their cut scores based on PTEA.
Equitability
Warrant B1: The same cut scores and decision rules are used to classify all students who have applied for the same program, and no other considerations are used.
Evidence: None is available in the documents that were reviewed.
Potential rebuttal(s): There is no direct evidence that stakeholders who use PTEA scores use them in a manner that supports this warrant.
Warrant B2: Test takers and other affected stakeholders are fully informed about how decisions will be made and whether decisions are actually made in the way described to them.
Evidence: Stakeholders who make admission decisions based on PTEA scores typically specify minimally acceptable scores on their websites, promotional materials, or applications. Pearson facilitates distribution of this information through its website and promotional material.
Potential rebuttal(s): There is no direct evidence to demonstrate that the admission decisions are actually made as described.
Consequences
Claim: The consequences of using PTEA and the admission decisions based on PTEA scores are beneficial to test-takers and the stakeholders that use the test, including institutions and organizations. Warrants A1–A5 address beneficial consequences of using the assessment, while Warrant B addresses the beneficial consequences of the decisions that are made.
Beneficence
Warrant A1: The consequences of using the assessment that are specific to each stakeholder group will be beneficial.
Evidence: For test takers, this warrant is supported by evidence from PTEA field test participant survey results. In an internal report to the TAG, the survey results were summarized and integrated with comments from interviews and focus groups in London, Sydney, and Beijing. In general, participants reacted positively to the overall test experience, as well as to the test structure, content, instructions, and score report.
For institutions and organizations, Pearson provides information regarding recommended uses of PTEA scores and suggests that the Communicative and Enabling Skills scores included in the score report provide additional information for both admissions and diagnostic purposes.
Potential rebuttal(s): No user feedback about the operational tests is available to supplement the evidence from the reviewed documents. There is also no direct evidence that stakeholders benefit in particular ways from the use of the PTEA for admission and diagnostic purposes.
Warrant A2: The reports of test scores and score-based admission decisions are treated confidentially for each individual test taker.
Evidence: According to Pearson’s website, assessment reports are only accessible through test taker-created online accounts. Test takers use these accounts to schedule test appointments, view assessment reports, and send reports to score users. A privacy policy is provided to users during the account creation process that notes that reasonable precautions are taken to protect personal information.
Potential rebuttal(s): It is not clear how reports on admission decisions at different educational institutions are treated.
Warrant A3: The reports of test scores and score-based decisions are presented in ways that are clear and understandable to all stakeholder groups.
Evidence: Pearson provides several documents to aid interpretation of PTEA score reports. Information in these documents is generally directed to test takers, but includes specific guidelines for institutions and organizations as well. In addition, an internal document describing a survey of field trial participants provides some evidence that test takers found the layout of the score report and sub-scores useful. Although the survey results suggest that participants may have been unclear about features of the score report, measures have been taken to address this issue.
Potential rebuttal(s): There is no evidence to demonstrate that the implemented measures have adequately addressed the issue. It is also not clear how reports on admission decisions are presented.
Warrant A4: The reports of test scores and score-based decisions are distributed to stakeholders in a timely manner.
Evidence: Test takers are expected to receive score reports within five working days. Score reports are viewed through a test taker’s online account, which can be utilized to send scores to recognizing institutions and organizations; this process takes up to 48 hours, according to Pearson’s website.
Potential rebuttal(s): It is not clear how reports on admission decisions are distributed at different educational institutions.
Warrant A5: The use of PTEA helps promote good instructional practice and effective learning in language instructional settings, and the use of the assessment is thus beneficial to students, instructors, and language programs.
Evidence: By design, PTEA includes item types that require integrated language skills with the goal of engaging test takers in authentic communicative language use tasks (Zheng & de Jong, 2011, para. 7). Anecdotally, in focus groups with field test participants, some participants explained that they felt they would need to prepare for the test by using authentic classroom material, and by practicing the use of their integrated language skills.
Potential rebuttal(s): Currently, there is no direct evidence that the use of PTEA promotes good instructional practice and effective learning of the test construct.
Warrant B: The consequences of the admission decisions will be beneficial for each group of stakeholders.
Rebuttal: A rebuttal to this warrant is that false positives or false negatives will have detrimental consequences. Specifically, test takers may receive a score higher than they deserve (e.g. through deception) and hence be admitted, or lower than they deserve (e.g. through administration or reporting error), and hence be denied admission.
Evidence: In order to mitigate the effects of false positives and negatives, local institutions can make efforts to minimize their occurrence, or to provide alternatives for test takers when they occur. In addition, if test takers believe that their test scores are incorrect, they may request a re-score or submit an item challenge form.
Summary and discussion
In this study a putative AUA was constructed for the purpose of reviewing the use of the PTEA for making admission decisions at tertiary-level institutions and organizations where English is used for communication. The AUA consists of four claims, along with warrants that are associated with each claim. The evidence that was available to support these claims and warrants has been analyzed and potential rebuttals have been articulated.
Evidence gathered from the review documents is relatively extensive for the claims about the assessment records and score interpretations. In contrast, much less evidence has been found for the claims about decisions and consequences. One reason may be that these claims are the only ones that are addressed in traditional test reviews. Another reason may be that supporting the claims about assessment records and score interpretations is typically the test developer’s primary responsibility whereas the responsibility for supporting the claims about decisions and consequences primarily rests with the local decision makers (Bachman & Palmer, 2010, pp. 433–434). To justify the intended test use, decision makers are expected to provide and collect relevant evidence, separately or in collaboration with the test developer. At the same time, we would note that it is the test developer’s responsibility to work with test users/decision makers to determine as clearly as possible what the precise uses of the test will be. While current thinking in the measurement profession is that this is where the test developer’s responsibility ends, we would urge test developers to start holding test users more accountable for the ways in which a given test is used.
Test stakeholders should weigh the positive evidence in support of the warrants in order to inform their own judgment about the justification of the specific test use of interest. Stakeholders also need to evaluate any negative evidence that suggests specific rebuttals, as well as potential rebuttals. Finally, stakeholders need to consider various educational and societal values in evaluating the justifiability of the test and test use in their local contexts.
In this review, the primary use of PTEA, for making admission decisions at tertiary-level institutions and organizations, has been targeted, and positive evidence has been identified from documents provided by the test for most of the articulated warrants. Rebuttal evidence has been rarely observed from the provided documents; however, potential rebuttals have been frequently noted.
It is suggested that the test developer conduct or sponsor studies to address the potential rebuttals we have noted. Evidence collected from such studies will facilitate more informed judgments about the credibility of the corresponding warrants and claims, and the degree to which the intended test use is justified. For example, neither internal consistency nor form equivalence reliability estimates can be found for the operational tests. An update in this aspect could assure test users that the consistency of the operational tests has been monitored and is at least comparable to that of the field test. This evidence may strengthen the relevant warrants and alleviate stakeholders’ concerns about the intended test use.
Another example pertains to the development and application of the automated scoring systems, which is directly addressed in six warrants under Consistency and in one warrant each under Meaningfulness, Impartiality, and Generalizability. We have noted the lack of supporting evidence for all these warrants, except for the warrant about consistency in scoring across different groups of test takers. The major aspects for which evidence is needed include the following:
ongoing monitoring and enhancement of the scoring systems;
identification and handling of student responses that are challenging to the systems;
details about the scoring algorithms and the types and weights of attributes that are used in automated scoring; and
representativeness of the scoring model and procedure to evaluation criteria and process that are used in human scoring or in the real-life target language use domain(s).
It is understandable that the test developer may be reluctant to release some of the information to stakeholders of the test because of concerns about intellectual property. In addition, providing stakeholders with information on exactly which features of speech are evaluated by the system and how they are weighted may compromise test security and validity (e.g. potentially creating negative washback). However, with the expanded use of automated scoring for high-stakes tests, it is increasingly important for some stakeholders to understand precisely how constructs have been operationalized in a scoring system. For these stakeholders, the benefits of a well-designed computer scoring system – speed, efficiency, and consistency – need to be weighed against the possibility of a narrower construct, which represents a potential rebuttal to several of the warrants that were articulated earlier in the review.
To support the intended test use, it would also be helpful for the test developer to examine the negative evidence that has been identified in the review and take measures to resolve the identified issues or mitigate the potential negative impact of unresolved issues. For instance, improving the quality of multiple-choice items or using a different test format may reduce the impact of test method on the intended score interpretations and therefore increase the credibility of Warrant A6 under Interpretations.
Footnotes
Acknowledgements
The authors would like to thank Ying Zheng and John De Jong of Pearson Education for providing relevant information about the research and development that has gone into the PTEA, along with reports on on-going research in support of the intended uses of this test. The authors would also like to thank Liying Cheng for her assistance in facilitating communications between the reviewers and the publisher.
