Abstract
Large language models (LLMs) are increasingly used to support psychological assessment, but standards for evaluating their scoring accuracy remain limited. This article introduces a clear, reproducible validation framework to evaluate LLM-based scoring systems. The framework separates pre-validation steps (e.g., balancing base rates, refining prompts, and comparing models) from a standardized validation phase focused on reliability and validity benchmarks. We demonstrate its application with a case study of Morbid Content (MOR) scoring in the Rorschach task, using a two-agent LLM workflow. In an independent dataset (n = 84; 2,176 responses) with natural MOR base rates, the final LLM coder showed good response level agreement (kappa = .72–.74) and excellent protocol level agreement (ICC = 0.94–0.95) with assessors, near-perfect consistency with itself (ICC = 0.97–0.99), and replicated external validity (r = .59–.71) that matched human coders (r = .54–.65). This article offers a practical guide for evaluating automated coders in psychological testing and discusses practical decisions and ethical considerations.
Keywords
Get full access to this article
View all access options for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
