Harnessing AI for Educational Measurement: Standards and Emerging Frontiers

Abstract

The surge of AI in education raises concerns about measurement downsides. Calls for clear standards are warranted. Fortunately, the psychometrics field has a long history of developing relevant standards—like sample invariance and item bias avoidance—crucial for reliable, valid, and interpretable assessments. This established body of knowledge, not unlike traffic laws for self-driving cars, should guide AI assessment development. Measuring new constructs necessitates stronger construct validity research. Instead of rewriting the rulebook, our focus should be on educating AI developers about these standards. This commentary specifically addresses the concern of empowering instructors not with high-stakes testing but with effective item writing through AI. We explore the potential of AI to transform item development, a key area highlighted by researchers. While AI tools offer exciting possibilities for tackling educational challenges, equipping instructors to leverage them effectively remains paramount.

Keywords

AI-powered psychometric tools educational measurement standards of testing

Introduction

The rapid rise of AI in education has sparked concerns about potential downsides in educational measurement. This has led to calls for establishing clear standards. Fortunately, the psychometrics community has a rich history of developing standards, many of which remain relevant. Instead of reinventing the wheel, our focus should be on educating AI developers about existing standards. Like how self-driving cars adhere to traffic laws, AI assessment tools should follow established principles like sample invariance and be free of item/test bias. This ensures reliability, validity, and interpretability of results. Established in 1966, the Standards for Educational and Psychological Testing has evolved through revisions, with the 2014 edition being the most recent (American Educational Research Association et al., 2014). It continues to be a widely respected guide for ethical and effective practices in educational and psychological testing.

The emergence of new technologies allows us to assess previously elusive constructs, like response time, and capture a wider range of data points during assessments. This underscores the enduring importance of construct validity. As Classical Test Theory emphasizes (Allen & Yen, 1979), measuring new constructs necessitates ongoing construct validity studies. These studies are crucial for informing the development of new standards specifically tailored to AI-based assessments. Recognizing this need, the National Council on Measurement in Education (NCME) has partnered with the American Psychological Association (APA) and the American Educational Research Association (AERA) to develop a new version of the Standards for Educational and Psychological Testing.

Modern measurement theories, as explored by Chang et al. (2021), are transforming assessment from static ranking methods to dynamic instruments that provide richer insights for all stakeholders. An increasing number of AI-driven assessment tools, such as computerized adaptive testing (CAT) and cognitive diagnostic modeling (CDM), have emerged, demonstrating potential to address enduring challenges within contemporary teaching and learning methodologies. However, to fully harness this potential, we need clear guidelines for using these tools effectively. Refining assessment standards is critical, and this involves two key aspects. Firstly, it is essential to leverage existing standards to ensure a smooth transition. Secondly, there’s a need to develop new standards that specifically address potential biases and promote fair learning and teaching environments.

This commentary expands on my presentation at the 2024 NCME Annual Meeting, “Sparking a Debate on the Role of Artificial Intelligence in Educational Measurement,” organized by Professor Steven Culpepper. As current standards for educational assessment lack guidance on AI use, experts are now reviewing and debating how to incorporate AI standards into existing frameworks. The Duolingo English Test applies Responsible AI Standards to leverage AI while upholding the importance of human expertise for dependable, secure, and streamlined assessment processes. AI is employed for test design, scoring, and security measures, complemented by human oversight to guarantee fairness and precision (Dieterle, 2024).

Instead of high-stakes testing, my focus is on how to empower instructors with AI for effective item writing and development. This commentary delves into the fourth point raised by researchers: how AI can transform item writing and development. While AI-powered tools offer exciting potential to address current teaching and learning challenges, a critical element lies in equipping instructors to leverage them effectively. This requires training item writers in utilizing AI-powered psychometric tools for:

Intelligent item banking: By analyzing student responses, AI can calibrate item difficulty and discrimination power, ensuring consistent parameters across diverse groups.

Validating the Q-matrix: Cognitive diagnostic tools powered by AI assist in validating pre-established Q-matrices, enabling accurate insights into students’ strengths and weaknesses based on their responses.

Detecting bias and fairness: AI can identify potential biases in items based on language, cultural references, or other factors, leading to fairer assessments.

Rethinking STEAM Education: Can AI Address Traditional Teaching Challenges?

Today’s teaching and learning environment faces numerous challenges. Consider introductory STEM courses at U.S. universities. Here, large impersonal lecture halls often cram thousands of freshmen into a “one-size-fits-all” approach. This lack of personalized attention, frequently compounded by inadequate teaching support like teaching assistants or graders, leads to alarmingly high DFW (D, Fail, or Withdraw) rates of 30% to 50%. The situation becomes even more concerning for underrepresented minority (URM) groups. Studies, like one from a recent midwestern university, show DFW rates skyrocketing to 66% in crucial math courses like Algebra and Trigonometry. These “gateway” courses act as prerequisites for many majors, and failing them forces URM students to switch majors, adding significant time and financial strain. To make matters worse, large class sizes, often numbering in the hundreds or even thousands, create a significant hurdle for instructors when designing and effectively grading assessments.

While research suggests that individualized instruction in small classes leads to better learning outcomes (Donnelly et al., 2015; Gerard et al., 2019), logistical constraints often prevent large universities from offering such settings. Generative AI presents a potential solution for personalized learning. However, current limitations restrict its ability to deliver fully individualized evaluations. As Zheng (2024) argues, generative AI needs to find a balance between providing detailed explanations and remaining concise for effective learner support. Large language models, like GPTs, require further development before they can offer truly tailored assessments and feedback. Therefore, traditional AI applications remain crucial in supporting educators until generative AI matures enough to fully automate classroom instruction. Traditional AI encompasses various tools that simulate human intelligence to assist with tasks like adaptive testing, identifying item bias, and using natural language processing for automated analysis.

The Rise of AI-Assisted Item Writing and the Need for Trained Instructors

Research suggests that established psychometric tools like CAT and CDM can personalize learning by tailoring assessments to individual student strengths and weaknesses. These tools, often categorized as traditional AI, demonstrably improve learning by providing tailored feedback and learning pathways. Notably, combining CAT and CDM into cognitive diagnostic computerized adaptive testing (CD-CAT) creates a powerful tutoring system that pinpoints a student’s mastered skills and areas needing improvement (Le et al., 2024; Liu et al., 2013, 2014; Morphew et al., 2018).

While teachers remain the cornerstone of assessment item creation, the rise of AI-driven item-writing tools necessitates a focus on training item writers to effectively leverage this technology. While teacher expertise in item creation remains essential, the integration of established traditional AI methods like automated test assembly (including CAT) necessitates the development of corresponding standards for classroom assessments. Research by Liu et al. (2013, 2014) demonstrates that providing basic psychometrics training to item writers leads to the creation of higher-quality assessment items for large-scale assessments compared to those without such training.

Fair and accurate STEM assessments hinge on minimizing differential item functioning (DIF). DIF occurs when students with similar abilities but from different backgrounds (ethnicity, gender) perform differently on an item. These differences are due to external factors, not the intended skill (Chang et al., 1996 and Cheng et al., 2013). A central challenge in DIF studies is establishing a fair comparison point for students across groups, either within the test or externally. This is crucial for identifying DIF items. Unlike mandatory DIF analysis for public and licensure exams in the United States, college course exams currently lack such protocols. Yet, DIF can exist in gateway STEM courses. For instance, physics items referencing military equipment showed DIF favoring men (Traxler et al., 2018).

Applying DIF analysis to classroom assessment requires careful consideration of DIF in this environment and the development of appropriate methods and software. Screening for DIF regularly can pinpoint items that might be biased against URM groups. Addressing these flagged items can not only enhance awareness but also refine item-writing skills. Therefore, establishing protocols for instructors with diverse backgrounds is crucial for test validity. Efforts to minimize item/test bias are essential to ensure fair and individualized assessments that enhance student understanding, encourage exam participation, and provide valuable diagnostic information. This can empower all students, particularly URM and women (URM&W). Evidently, this research has significant educational policy implications.

Conclusion

AI is transforming educational testing. Psychometric tools like CAT (adapting difficulty) and E-Rater (automated scoring) streamline assessments, while bias detection (DIF items) and detailed learning profiles (cognitive diagnostic models) offer deeper insights. However, established standards ensure quality in high-stakes testing, a framework currently lacking in classroom assessments. Integrating and expanding these guidelines can equip instructors with the skills to leverage AI effectively in item creation, leading to efficient, accurate, and fair classroom assessments.

The field of educational measurement, guided by the Standards established in 1966 (revised 2014), is confronting fresh challenges amid the rapid advancement of AI. Concerns regarding its potential drawbacks on measurement have arisen, leading to calls for revising pertinent standards to ensure responsible and effective AI utilization in educational testing. Conversely, contemporary measurement theories are revolutionizing testing from a static ranking system into a dynamic and informative tool, better catering to the diverse needs of education stakeholders. Deliberations are ongoing on harnessing AI to augment existing psychometric tools such as CAT and CDM. This integration holds promise for further personalizing learning experiences (e.g., Chang, 2015; Chang et al., 2021). It is evident that the new standards should give special consideration to classroom assessment.

Footnotes

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Hua-Hua Chang

Author

HUA-HUA CHANG is the Charles R. Hicks Chair Professor in the Department of Educational Studies at Purdue University, Beering Hall, Room 5116, 100 N University Street, West Lafayette, IN 47907; e-mail: Chang606@Purdue.edu. He focuses on personalizing learning experiences through the use of Computerized Adaptive Testing (CAT).

References

Allen

M. A.

Yen

W. M.

(1979). Introduction to measurement theory. Reprint by Waveland Press, Dec 14, 2001.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). The standards for educational and psychological testing (2014th ed.).

Chang

H.-H.

(2015). Psychometrics behind computerized adaptive testing. Psychometrika, 80(1), 1–20.

Chang

H.-H.

Mazzeo

Roussos

(1996). Detecting DIF for polytomously scored items: An adaptation of the SIBTEST procedure. Journal of Educational Measurement, 33(3), 333–353.

Chang

H.-H.

Wang

Zhang

(2021). Statistical applications in educational measurement. Annual Review of Statistics and Its Application, 8, 439–461.

Cheng

Chen

P-H

Qian

J-H.

Chang

H-H.

(2013). Equated pooled booklet method in DIF testing. Applied Psychological Measurement, 37(4), 276–288.

Dieterle

(March 6, 2024). AI and emerging technology ambassadors in education: researchers lead the way. https://blog.englishtest.duolingo.com/ai-and-emerging-technology-ambassadors-in-education-researchers-lead-the-way/#:∼:text=Leveraging%20AI%20for%20secure%2C%20efficient,oversight%20ensures%20fairness%20and%20accuracy

Donnelly

D. F.

Vitale

J. M.

Linn

M. C.

(2015). Automated guidance for thermodynamics essays: Critiquing versus revisiting. Journal of Science Education and Technology, 24(6), 861–874.

Gerard

Kidron

Linn

M. C.

(2019). Guiding collaborative revision of science explanations. International Journal of Computer-Supported Collaborative Learning, 14(3), 291–324.

10.

Nissen

Tang

Zhang

Mehrabi

Chang

Dusen

(under review). Assessing the assessments with Mechanics Cognitive Diagnostic: Skills tested in introductory physics courses.

11.

P. V.

Nissen

J. M.

Tang

Zhang

Mehrabi

Morphew

Chang

H-H.

Van Dusen

(2024). Assessing the assessments with mechanics cognitive diagnostic: Skills tested in introductory physics courses. PsyArXiv. https://arxiv.org/abs/2404.00009

12.

Liu

You

Wang

Ding

Chang

H.-H.

(2013). The development of computerized adaptive testing with cognitive diagnosis for an English achievement test in China. Journal of Classification, 30, 152–172.

13.

Liu

You

Wang

Ding

Chang

H.-H.

(2014). Large-scale implementation of computerized adaptive testing with cognitive diagnosis in China. In Cheng

Chang

H.-H.

(Eds.), Advanced methodologies to support both summative and formative assessments (pp. 245–261). Information Age Publisher Inc.

14.

Morphew

Mestre

Kang

Chang

H.-H.

Fabry

(2018). Using computer adaptive testing to assess physics proficiency and improve exam performance. Physical Review Physics Education Research, 020110-1-202110-16. https://doi.org/10.1103/PhysRevPhysEducRes.14.010127

15.

Traxler

Henderson

Stewart

Papak

Lindell

(2018). Gender fairness within the force concept inventory. Physical Review Physics Education Research, 14(1), 010103.

16.

Zheng

(February 2024). Psychometrics empowering large language models in Chinese essay automated scoring. Invited presentation #2 in Craft Data Science Insights: A 40 Minute Exploration. College of Education at Purdue University. https://youtu.be/2AmBLN0C5m8?si=yS9ePFgxtYyrc2nX