Abstract

In this closing commentary, we offer our appraisal of the Special Issue, our views on where the discipline of language testing stands with respect to Open Science (OS), and what we should be aiming for going forward.
Behind the curve
As a subdiscipline of applied linguistics, language testing was once ahead of the curve in terms of advancing research methods, in particular, the application of quantitative analyses and, of course, advocating for a rigorous approach to measurement and validation. For example, advanced statistical analyses like confirmatory factor analysis (Bachman & Palmer, 1982) and Rasch modeling (McNamara, 1990) in language testing research became popular well before seeing more widespread application in other areas of applied linguistics, and language testers offered critique of/advice to other subfields regarding construct measurement and validation (e.g., Douglas, 2001). This earned a positive reputation, with other subdisciplines looking to our field for guidance. A recent exemplification of language testing’s methodological contributions to other areas of applied linguistics is the recent Routledge Handbook of Second Language Acquisition and Language Assessment (Winke & Brunfaut, 2021).
When it comes to OS, however, language testing appears to be behind the curve in terms of interest and current engagement in related practices. Some support for this claim comes from the behind the scenes of this special issue. When we put out the call for proposals, we received a total of 18 submissions. In comparison with recent previous special issues, this number is on the smaller end: slightly more than the 15 for 2023’s Special Issue on accommodations in language testing (a critically important topic, but nonetheless one that perhaps still has a niche status in the broader field, Taylor & Banerjee, 2023) but far fewer than the 104 for 2022’s Special Issue on local language tests (Dimova et al., 2022). We found this somewhat surprising given that our call was topically rather open: We sought empirical papers which either substantively addressed OS directly or addressed any topic within the scope of Language Testing as long as the study also exemplified engagement with OS practices (e.g., by sharing data, materials, and/or analysis code). Notably, many of the submitted proposals and eventually published articles came from researchers who primarily work in areas outside of language testing. Among proposals and published articles from those who primarily do work in language testing, authors clearly skewed to the early-career end of the spectrum. Senior (and even mid-career) language testing scholars were largely absent among proposal authors. This suggests a somewhat limited interest in, or at least engagement with, OS in the field of language testing and assessment.
Looking more broadly at practices in language testing research, Liu et al. (2024) found that only about 14% of empirical studies in the flagship journals of the discipline (Language Testing and Language Assessment Quarterly) since 2008 were published Open Access (OA) and a mere 1% provided open data (8/707) or analysis code (9/707). Without any comparison with other disciplines, 1% is a very low proportion of articles with open data or code, and 14% (roughly 1 out of 7) is clearly toward the lower end of what is possible for OA publishing. More encouragingly, Liu et al. (2024) noted that open materials are somewhat common (in about 35% of empirical studies) and found upticks in the proportion of open manuscripts, code, and data in recent years. Among non-empirical studies, OA publishing is slightly better off, in part due to openly available test reviews in Language Testing, which serve a valuable function of getting professional expertise to non-expert test users (Harding & Winke, 2021). And it must be mentioned that language testing does have some Diamond OA journals, notably Studies in Language Assessment, a journal that appears to be developing in both prestige and number of articles published. Still, it seems quite clear (to us, at least) that language testing has some catching up to do. Applied linguistics subfields such as computer-assisted language learning (Language Learning & Technology, ReCALL) and SLA/language teaching (Studies in Second Language Learning and Teaching) have prominent, well-regarded fully OA journals as well as high rates of OA publication in hybrid OA journals (e.g., 29% in Bilingualism: Language & Cognition in 2020–2022, according to Journal Citation Reports; we note this journal has announced it will become fully OA in 2025). Furthermore, journals such as Applied Psycholinguistics now have stringent policies requiring open materials, data, and code (or a reasonable explanation as to why sharing is not possible).
There are plausible explanations for this limited interest and engagement. As Winke (2024), whose Viewpoint is generally positive toward OS in language testing, noted, there are confidentiality and security concerns with OS practices, such as data and materials sharing, that must be taken seriously, a sentiment echoed by several of the responses. Financial interests of the language testing industry (see Isbell & Kim, 2023) may also, at times, have negative impacts on the adoption of OS practices, as it may be seen as “enough,” or even preferable, to simply have a published validation report without the additional transparency that OS brings. There also seems to be some sentiment among language testers that language testing research does not need OS to be rigorous and that OS might even encourage sloppy research. In this Special Issue, Chapelle and Ockey’s (2024) response to Winke offers the strongest critique of OS along these lines. To elaborate, they make the arguments that both the overall quality of language testing research and the quality of training provided to language testing researchers is high and that researchers lacking detailed contextual knowledge may be prone to misusing data they did not themselves collect (i.e., open data shared by others), with the latter point echoed by Gebril and Bali (2024). These are reasonable points, and we personally do not disagree regarding the generally high quality of language testing research and training for graduate students. Ultimately, however, we find the hesitation unfortunate given the potential for OS to offer benefits above and beyond what the field has already achieved.
Regardless of obstacles and hesitations, it still stands that with respect to OS, language testing finds itself not as a leader in applied linguistics, but as a subfield with much to learn and some catching up to do. It is now our turn to learn from methodological innovations and insights developed or adopted in other disciplines within applied linguistics, much as those other subfields have looked to language testers over the years for guidance on quantitative analysis and validation.
Why we should catch up
Acknowledging that the nature of language testing data may not allow for sharing as readily as some other fields (see Chapelle & Ockey, 2024; Isbell & Kim, 2023; Winke, 2024), we believe that OS is worthwhile and that this special issue highlights many reasons why. In what follows, we highlight reusability, scrutability, and trustworthiness throughout the research process, benefits for researcher training, and benefits for industry testing organizations.
Reusable materials to increase efficiency
First, the contributions to this Special Issue have generated several openly accessible resources which can be used by other researchers and test developers. Pan and Marsden’s (2024) Tests of Aptitude for Language Learning can be readily used by others whose research involves Mandarin first language (L1) participants. While not a language test in a conventional sense, aptitude tests have a long history of use in the field and are particularly of interest for validating other tests, as aptitude can substantially explain variation in test scores and/or changes in test scores over time. While the proprietary Modern Language Aptitude Test has been widely used in research, Pan and Marsden have provided researchers with a more accessible and affordable tool. Dudley et al.’s (2024) Context-Aligned Two Thousand Test of French vocabulary was designed to assess the receptive vocabulary knowledge of students in U.K. secondary schools. This test is likely to be useful to those researching French teaching and learning in the United Kingdom, and the transparency offered by the open materials may prove useful to others seeking to design similar tests (perhaps for a different language or a different French curriculum). Finally, Nishizawa’s (2024) study on authenticity in academic English listening tests from a fluency perspective also includes a set of benchmarks and underlying data that test developers can use to evaluate their own test’s content and inform test task specifications. All of these contributions promote efficiency, transparency, replicability, and communal efforts in research and practice, yielding benefits above and beyond studies of similar topics and scope that do not share materials openly.
Increased scrutability of analyses
All empirical studies featured in this Special Issue have openly shared data and analysis scripts. Perhaps one of the greatest benefits to the research community that OS brings is the ability to more thoroughly interrogate published findings through reanalysis of open data and code, responding to Byrnes’ (2013, p. 825) call for increased “professional scrutiny” of methodological issues in applied linguistics research. If there is anything pertaining to the data, analyses, and/or results in one of this issue’s empirical studies you have questions or doubts about as a reader, you can download the data and investigate. We can also say that reviewers and editors took advantage of this, which in turn led to analytical refinements and/or improvements to the usability of the analysis code in some cases; the Transparent Review pilot provides some record of this, too (discussed later in this section). This openness should serve to increase trust and make it more likely, compared with research that is less open, that potential errors are found and corrected, ideally before publication, or (should any still exist) afterwards.
Benefits for researcher training
The open data in these studies can also be used synthetically in future research, such as by combining data that was elicited using the same instruments (e.g., Isbell & Son, 2022). In fact, Pan and Marsden (2024) explicitly discuss this possibility. In addition, the availability of real-world data has advantages for education and training. When teaching quantitative methods, including analyses commonly used in language testing that require item-level data, it is both useful and engaging to have real-world data for students to learn and practice with. For one, real-world data may have some messiness that simulated data lack, like missing data or a need for reformatting that provides students with opportunities to develop practical data skills. Second, real-world data can often be linked to actual test forms, tasks, and items, allowing students to practice connecting statistical results, like difficulty and misfit, with test content. In this Special Issue, the data Ha et al. (2024) have shared provide an excellent opportunity for other researchers to learn and practice a random forest analysis, which is not common in language testing: By downloading their data, readers can attempt to conduct the same analyses and compare their results to those of the published article.
Guards against questionable research practices
Burton’s (2024) study deserves special recognition as a rare example of preregistration in language testing (and applied linguistics more broadly). Burton provided a link to a timestamped preregistration of his study’s research questions, hypotheses, and proposed methods, which allows readers to evaluate the extent to which he deviated from his original plans or hypotheses. With respect to the latter, preregistration is thought to be one of the most effective means available to guard against HARKing (hypothesizing after the results are known, Kerr, 1998) through which researchers craft just-so narratives rather than addressing results that conflict with what a motivating theory actually predicted, seriously undermining hypothetico-deductive confirmatory research (see Isbell et al., 2022 and Larsson et al., 2024, for more on the prevalence of this practice in applied linguistics). Hopefully, this will be a harbinger for further preregistration of research protocols in our field, including as Language Testing Registered Reports—a dedicated article type for article preregistration that offers authors an in-principle acceptance for a full-length paper to follow if they adhere to their plan or robustly justify any deviations (see Isaacs & Winke, 2024).
Piercing the veil of peer review
In this special issue, we have piloted a more open peer review process by implementing a version of Transparent Review. Namely, we asked the authors and all reviewers whether they wished to participate, and if all agreed, we proceeded to collate all peer reviews, editor decisions, and author responses into one document as an online supplement that accompanies an accepted, published article. Like traditional double-blind reviewing, authors and reviewers remained anonymous throughout the review process.
While it is too early to assess the impact of this pilot, in our view Transparent Review can provide a useful behind-the-scenes record of the peer review process that highlights the general rigor of the process. As readers can see, the peer commentary and author responses are extensive, and while the overall tone is constructive and supportive (a possible side-effect of knowing one’s correspondence will be publicly available), readers will also be able to see that participation in the transparent review process did not result in reviewers holding back critical comments (a concern often voiced by colleagues not yet convinced of more transparent peer review practices). It is also the case that peer commentary from one reviewer to another may not always be in perfect alignment, as some reviewers catch infelicitous or problematic aspects of a manuscript that others do not or simply have diverging views on various matters. In such cases, readers can see where the editor has weighed in and which advice the authors ultimately took up. Although we see many benefits of Transparent Review, there may also be some drawbacks or unintended consequences. For example, some authors or reviewers may not wish to participate in Transparent Review, which raises questions of the usefulness of Transparent Review if it is only done selectively and whether it might drive authors or reviewers away from a journal. Still, we believe that the largely positive experience of this issue’s pilot should motivate Language Testing to further explore ways to make the peer review process more transparent.
We encourage readers to examine the Transparent Review supplemental files; graduate students and early-career researchers may find them useful for better understanding how to provide constructive peer reviews and how to respond to reviewer comments. While peer reviewers’ identities remain anonymous, the author and handling editor’s identities are not, which adds another degree of transparency when linked to Language Testing’s recently strengthened application of author conflict of interest (COI) disclosure and field-leading COI disclosure for all handling editors of manuscripts (Sage, 2023). As a field with substantial industry involvement (Isbell & Kim, 2023), language testing may be well-positioned to lead the way in peer review transparency in applied linguistics more broadly. Compared with other applied linguistics subdisciplines, the funding sources for language testing research, commercial stakes of many tests, and reputations of tests and test developers (commercial and non-profit alike) are more likely to result in conflicts of interest and thus create a stronger impetus for greater transparency in and scrutiny of research.
Collective innovations and improvements
On the topic of the language testing industry, we also believe that OS will likely yield benefits to test developers and ultimately have a positive impact on testing practice. LaFlair (2024) discussed the “business case” for OS for industry institutions and suggested that adopting OS practices (and participating in OS communities of practice) could help attract talent. As we noted earlier, language testers who submitted to this Special Issue tended to be early-career researchers and/or language assessment practitioners. Industry may indeed be seen as more attractive to the next generation of language testers if it can show, with concrete actions and tangible support, that it contributes to knowledge sharing and transparency.
Beyond attracting talent, engagement in OS practices is likely to lead to other tangible benefits. Nishizawa’s (2024) independently conducted research is immediately applicable to the IELTS and TOEFL iBT, and thanks to the open data and scripts, developers of other tests can integrate data from their own listening test passages to evaluate how their tests stack up. Consider also natural language processing (NLP) technology, which is frequently used to extract linguistic indices from text for the purpose of automated scoring in language tests. To name but one example, a recent study by Kyle and Eguchi (2024) found that training widely used NLP tools on second language (L2) data improved the tools’ performance when analyzing L2 sentences. Openly accessible, high-quality L2 language data (say, the kind that might be collected in highly standardized conditions from a diverse range of L1 backgrounds) could be used to generate better NLP tools that may lead to downstream benefits in terms of better scoring tools and better practice in operational language tests. In part, this may require industry organizations to think beyond immediate benefits to their proprietary products (e.g., tests, automated scoring systems) and orient not just toward research (more “basic” research less specific to particular tests is sometimes supported by these organizations) but also to resources that could potentially advance the field as a whole.
The next generation: Optimism for open science in language testing
Earlier, we noted that contributions to this Special Issue came mostly from early-career language testing researchers and researchers whose primary area of expertise lies outside of language testing. This is a reason to be optimistic about the future of OS in language testing, provided that we (continue to) support early-career testing researchers’ efforts and keep our disciplinary borders open. Among early-career testing researchers, Burton (2024) and Nishizawa (2024) each demonstrate commendable commitment to OS in language testing. Both authors made their datasets and analysis scripts open, and Burton is notably one of just a handful of pre-registered studies in the field. These two authors completed their PhDs within the last 2 years, and we hope are a sign of things to come. It is equally encouraging to see engagement from SLA researchers in our field in this Special Issue, with senior, rising mid-career, and early-career researchers making valuable contributions. This is perhaps a product of the long-running dialogue between language testing and SLA, noted earlier, with SLA researchers increasingly seeing language testing and its academic journals as a home for scholarship on the development and validation of instruments used in research. SLA researchers have also been leading applied linguistics in meta-research and OS practices (e.g., Al-Hoorie et al., 2024; Al-Hoorie & Vitta, 2019; Liu et al., 2023; Marsden et al., 2018; Marsden & Morgan-Short, 2023), and a team of three such researchers (Liu et al., 2024) have provided language testing with a fresh (and much-needed, in our view) perspective on OS practices in our field. We expect that language testing will continue to attract researchers interested in meta-research and OS, and the recently introduced Systematic Review article type in Language Testing for such research (Harding & Winke, 2022) and recently updated author guidelines will certainly be of good use for such research in the future.
Although more senior language testing researchers in academia did not contribute empirical articles (nor submit proposal abstracts) to the Special Issue, Winke’s (2024) Viewpoint and responses from academic researchers (some with current or past involvement with professional language testing organizations) around the world demonstrate a good deal of support for OS, despite some legitimate reservations (most strongly expressed by Chapelle & Ockey, 2024). It is especially encouraging to see discussion of national-level support for OS in different international settings, including China (Jin & Fan, 2024), Egypt (Gebril & Bali, 2024), and Japan (Koizumi et al., 2024).
Another source of optimism comes from professional language test developers outside of academia (“industry”). As recounted in several responses to Winke’s (2024) Viewpoint, major international testing organizations such as the British Council, Cambridge University Press and Assessment, Duolingo, and Educational Testing Service are (a) already engaging in some OS practices, like maintaining publicly-accessible (with restrictions, in some cases) datasets and (b) are supportive of OS practices for internal and external research they fund. This is all despite the non-trivial costs of OS and concerns for test taker privacy and test security.
How to get ahead of the curve again
So far, we have pointed out several shortcomings regarding OS in language testing, provided a rationale for why OS is worth doing, and discussed some reasons for optimism. Continuing from optimism, we now wish to offer some constructive suggestions for fostering OS in language testing going forward.
As a field, greater support for equitable OA publishing is needed. As this Special Issue demonstrates, conventional publishers with Hybrid OA publishing options (Language Testing and Language Assessment Quarterly, Liu et al., 2024) see only a fraction of research published openly, with researchers in the Global South even less likely to be able to publish OA. Language testing as a field, including professional organizations such as the International Language Testing Association (ILTA) and test developers that conduct and fund research, should aim to support OA publication infrastructure, beyond paying commercial publishers for individual Hybrid/Gold OA article publication fees. This could take the form of directly and financially supporting journals like Studies in Language Assessment, potentially launching and funding new Diamond OA journals, and/or trying to negotiate agreements with publishers of Hybrid OA journals to “flip” to a universal OA publication model, as was recently achieved for several journals in applied linguistics published by Cambridge University Press (e.g., ReCALL currently and Bilingualism: Language and Cognition in 2025).
Industry organizations should commit substantially to supporting other aspects of OS in language testing. Alongside incentivizing OS practices through incorporating them into funding calls and criteria, one way to do this, as highlighted in Winke’s (2024) Viewpoint and several responses, is to create and maintain high-quality collections of test data that are available openly or with reasonable (and ideally minimal) restrictions. Many organizations do this already, but we note that such practice does not appear common yet outside of larger developers of English language tests. Making researcher-friendly practice tests or special research forms is something we see emerging as an issue of critical importance for supporting high-quality, transparent research in language testing in the future. Traditionally, independent language testing researchers have been able to use official practice tests in research, a form of open materials that predate the OS movement. However, with the propagation of digital language tests, we see a major risk that academic and independent researchers will not have adequate access to digital practice tests for research purposes, a problem that seems within the power of test developers to solve.
Researchers themselves must play a key role in achieving greater uptake of OS practices. As seen in this Special Issue, junior/early-career researchers may be leading the way (see also, e.g., Hui & Huntley, 2020), and they should be supported and encouraged to continue sharing their data, analysis code, and materials as well as do the hard work of preregistration. More established language testing researchers can positively influence junior researchers by setting examples of OS engagement themselves and by incorporating OS into courses they teach, including ethics training that addresses when OS practices should be avoided. Researchers should adopt a more incrementalist mindset to research in language testing, where each individual study is seen as but one data point to be (dis)confirmed and built on with future research, and in turn, that future research is clearly facilitated by access to materials, analysis code, and data from previous studies. With increased transparency and accessibility would come increased replicability (as procedures and analyses of previous studies could be more precisely followed), and replication studies should be more encouraged in general. Finally, as has been argued elsewhere (Al-Hoorie et al., 2024), OS is not an all-or-nothing endeavor, and doing what one can when it is feasible can still contribute to larger communal benefits as well as changes in a field’s research culture.
In closing, we argue that OS provides benefits to the language testing research community and beyond, increasing the utility and scrutability of research, and we hope this Special Issue illustrates and contributes to this. OS, including the relatively simple act of publishing open manuscripts, can help test developers and academic researchers alike share their work more effectively within and across discourse communities, including the public (Chapelle, 2021). Scrutability, achieved through practices such as preregistration, open data, open code, and open materials, is particularly important for advancing justice in the use of language tests (Kunnan, 2018), as researchers and the broader public can more effectively and thoroughly reason through available evidence when it is more comprehensively accessible. While we must ensure appropriate safeguards and restrictions are in place to protect the privacy of test takers and the security of test material, it is time for language testing to move forward, catch up with other disciplines, and, we hope, move to innovate and lead in OS.
Footnotes
Author contributions
Declaration of conflicting interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Daniel R. Isbell has recent or ongoing relationships involving research funding, honoraria, or consulting fees from the following organizations mentioned in this editorial: British Council, Cambridge University Press and Assessment, Duolingo, Educational Testing Service, and IELTS Consortium.
Benjamin Kremmel is the Book Review Editor of Language Testing and served on the Editorial Advisory Board of Language Assessment Quarterly from 2018-2023. He has recent or ongoing relationships involving research funding, honoraria, or consulting fees from the following organizations mentioned in this editorial: British Council, Cambridge University Press and Assessment, Educational Testing Service, and IELTS Consortium.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
