Evaluating LLM-Based Coders in Psychological Assessment: A Validation Framework With Application to the Rorschach Morbid Content Variable

Abstract

Large language models (LLMs) are increasingly used to support psychological assessment, but standards for evaluating their scoring accuracy remain limited. This article introduces a clear, reproducible validation framework to evaluate LLM-based scoring systems. The framework separates pre-validation steps (e.g., balancing base rates, refining prompts, and comparing models) from a standardized validation phase focused on reliability and validity benchmarks. We demonstrate its application with a case study of Morbid Content (MOR) scoring in the Rorschach task, using a two-agent LLM workflow. In an independent dataset (n = 84; 2,176 responses) with natural MOR base rates, the final LLM coder showed good response level agreement (kappa = .72–.74) and excellent protocol level agreement (ICC = 0.94–0.95) with assessors, near-perfect consistency with itself (ICC = 0.97–0.99), and replicated external validity (r = .59–.71) that matched human coders (r = .54–.65). This article offers a practical guide for evaluating automated coders in psychological testing and discusses practical decisions and ethical considerations.

Keywords

psychological assessment Rorschach morbid content psychometrics automated scoring validation framework

Get full access to this article

View all access options for this article.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Report and recommendations for the reauthorization of the institute of education sciences.

Bigman

Y. E.

Gray

(2018). People are averse to machines making moral decisions. Cognition, 181, 21–34. https://doi.org/10.1016/j.cognition.2018.08.003

Bornstein

R. F.

(2011). Toward a process-focused model of test score validity: Improving psychological assessment in science and practice. Psychological Assessment, 23(2), 532–544. https://doi.org/10.1037/a0022402

Breiman

(2001). Random Forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

Brickman

Gupta

Oltmanns

J. R.

(2025). Large language models for psychological assessment: A comprehensive overview. Advances in Methods and Practices in Psychological Science, 8(3). https://doi.org/10.1177/25152459251343582

Brown

Mann

Ryder

Subbiah

Kaplan

J. D.

Dhariwal

Neelakantan

Shyam

Sastry

Askell

Agarwal

Herbert-Voss

Krueger

Henighan

Child

Ramesh

Ziegler

Winter

Amodei

(2020). Language models are few-shot learners. In Larochelle

Ranzato

Hadsell

Balcan

M. F.

Lin

(Eds.), Advances in neural information processing systems (Vol. 33, pp. 1877–1901). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

Cicchetti

D. V.

(1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284–290. https://doi.org/10.1037/1040-3590.6.4.284

Collins

G. S.

Moons

K. G. M.

Dhiman

Riley

R. D.

Beam

A. L.

Van Calster

Ghassemi

Liu

Reitsma

J. B.

Van Smeden

Boulesteix

A.-L.

Camaradou

J. C.

Celi

L. A.

Denaxas

Denniston

A. K.

Glocker

Golub

R. M.

Harvey

Heinze

. . .Logullo

(2024). TRIPOD+AI statement: Updated guidance for reporting clinical prediction models that use regression or machine learning methods. British Medical Journal, 385, Article e078378. https://doi.org/10.1136/bmj-2023-078378

Cumming

Finch

(2005). Inference by eye: Confidence intervals and how to read pictures of data. American Psychologist, 60(2), 170–180. https://doi.org/10.1037/0003-066X.60.2.170

10.

Dauphin

Siefert

(2025). From Llama to language: Prompt-engineering allows general-purpose artificial intelligence to rate narratives like expert psychologists. Frontiers in Artificial Intelligence, 8, Article 1398885. https://doi.org/10.3389/frai.2025.1398885

11.

de Almeida Schneider

A. M.

Bandeira

D. R.

Meyer

G. J.

(2022). Rorschach Performance Assessment System (R-PAS) interrater reliability in a Brazilian adolescent sample and comparisons with three other studies. Assessment, 29(5), 859–871. https://doi.org/10.1177/1073191120973075

12.

Demszky

Yang

Yeager

D. S.

Bryan

C. J.

Clapper

Chandhok

Eichstaedt

J. C.

Hecht

Jamieson

Johnson

Jones

Krettek-Cobb

Lai

JonesMitchell

Ong

D. C.

Dweck

C. S.

Gross

J. J.

Pennebaker

J. W.

(2023). Using large language models in psychology. Nature Reviews Psychology, 2, 688–701. https://doi.org/10.1038/s44159-023-00241-5

13.

Devine

R. T.

Kovatchev

Grumley Traynor

Smith

Lee

(2023). Machine learning and deep learning systems for automated measurement of “advanced” theory of mind: Reliability and validity in children and adolescents. Psychological Assessment, 35(2), 165–177. https://doi.org/10.1037/pas0001186

14.

Dietvorst

B. J.

Simmons

J. P.

Massey

(2015). Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: General, 144(1), 114–126. https://doi.org/10.1037/xge0000033

15.

Eberhardt

S. T.

Vehlen

Schaffrath

Schwartz

Baur

Schiller

Hallmen

André

Lutz

(2025). Development and validation of large language model rating scales for automatically transcribed psychological therapy sessions. Scientific Reports, 15(1), 29541. https://doi.org/10.1038/s41598-025-14923-y

16.

Ewbank

M. P.

Cummins

Tablan

Catarino

Buchholz

Blackwell

A. D.

(2021). Understanding the relationship between patient language and outcomes in internet-enabled cognitive behavioural therapy: A deep learning approach to automatic coding of session transcripts. Psychotherapy Research, 31(3), 300–312. https://doi.org/10.1080/10503307.2020.1788740

17.

Exner

J. E.

(2003). The rorschach: A comprehensive system (4th ed., Vol. 1). John Wiley & Sons.

18.

Fareed

Fatima

Uddin

Ahmed

Sattar

M. A.

(2025). A systematic review of ethical considerations of large language models in healthcare and medicine. Frontiers in Digital Health, 7, Article 1653631. https://doi.org/10.3389/fdgth.2025.1653631

19.

Furr

R. M.

(2022). Psychometrics: An introduction (4th ed.). Sage.

20.

Grove

W. M.

Andreasen

N. C.

McDonald-Scott

Keller

M. B.

Shapiro

R. W.

(1981). Reliability studies of psychiatric diagnosis: Theory and practice. Archives of General Psychiatry, 38(4), 408–413. https://doi.org/10.1001/archpsyc.1981.01780290042004

21.

Guo

Lai

Thygesen

J. H.

Farrington

Keen

(2024). Large language models for mental health applications: Systematic review. JMIR Mental Health, 11, Article e57400. https://doi.org/10.2196/57400

22.

Hajian-Tilaki

(2013). Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation. Caspian Journal of Internal Medicine, 4(2), 627–635.

23.

Garcia

E. A.

(2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. https://doi.org/10.1109/TKDE.2008.239

24.

International Test Commission & Association of Test Publishers. (2025). Guidelines for technology-based assessment. https://www.intestcom.org/page/28

25.

Jeon

Yoo

Lee

Son

Kim

Han

(2024, March 21–22). A dual-prompting for interpretable mental health language models. In Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024) (pp. 247–255). Association for Computational Linguistics. https://doi.org/10.48550/ARXIV.2402.14854

26.

Kjell

O. N. E.

Kjell

Schwartz

H. A.

(2024). Beyond rating scales: With targeted evaluation, large language models are poised for psychological assessment. Psychiatry Research, 333, 115667. https://doi.org/10.1016/j.psychres.2023.115667

27.

Kottner

Audigé

Brorson

Donner

Gajewski

B. J.

Hróbjartsson

Roberts

Shoukri

Streiner

D. L.

(2011). Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. Journal of Clinical Epidemiology, 64(1), 96–106. https://doi.org/10.1016/j.jclinepi.2010.03.002

28.

Tang

Liu

Spirtes

Zhang

Leqi

Liu

(2025). Prompting fairness: Integrating causality to debias large language models. The 13th International Conference on Learning Representations (ICLR 2025), Singapore.

29.

Liu

Yuan

Jiang

Hayashi

Neubig

(2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35. https://doi.org/10.1145/3560815

30.

Mayer

M. M.

Buchner

Bell

(2023). Humans, machines, and double standards? The moral evaluation of the actions of autonomous vehicles, anthropomorphized autonomous vehicles, and human drivers in road-accident dilemmas. Frontiers in Psychology, 13, Article 1052729. https://doi.org/10.3389/fpsyg.2022.1052729

31.

McGrath

R. E.

Pogge

D. L.

Stokes

J. M.

Cragnolino

Zaccario

Hayman

Piacentini

Wayland-Smith

(2005). Field reliability of Comprehensive System scoring in an adolescent inpatient sample. Assessment, 12(2), 199–209. https://doi.org/10.1177/1073191104273384

32.

McGraw

K. O.

Wong

S. P.

(1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46. https://doi.org/10.1037/1082-989X.1.1.30

33.

Meehl

P. E.

Rosen

(1955). Antecedent probability and the efficiency of psychometric signs, patterns, or cutting scores. Psychological Bulletin, 52(3), 194–216. https://doi.org/10.1037/h0048070

34.

Meyer

G. J.

(1997). On the integration of personality assessment methods: The Rorschach and MMPI. Journal of Personality Assessment, 68(2), 297–330. https://doi.org/10.1207/s15327752jpa6802_5

35.

Meyer

G. J.

(1999). The convergent validity of MMPI and Rorschach Scales: An extension using profile scores to define response and character styles on both methods and a reexamination of simple Rorschach response frequency. Journal of Personality Assessment, 72(1), 1–35. https://doi.org/10.1207/s15327752jpa7201_1

36.

Meyer

G. J.

(2004). The reliability and validity of the Rorschach and TAT compared to other psychological and medical procedures: An analysis of systematically gathered evidence. In Hilsenroth

Segal

(Eds.), Personality assessment (Vol. 2, pp. 315–342). Wiley.

37.

Meyer

G. J.

Hilsenroth

M. J.

Baxter

Exner

J. E.

Fowler

J. C.

Piers

C. C.

Resnick

(2002). An examination of interrater reliability for scoring the Rorschach Comprehensive System in eight data sets. Journal of Personality Assessment, 78(2), 219–274. https://doi.org/10.1207/S15327752JPA7802_03

38.

Meyer

G. J.

Riethmiller

R. J.

Brooks

R. D.

Benoit

W. A.

Handler

(2000). A replication of Rorschach and MMPI-2 convergent validity. Journal of Personality Assessment, 74(2), 175–215. https://doi.org/10.1207/S15327752JPA7402_3

39.

Meyer

G. J.

Viglione

D. J.

Mihura

J. L.

Erard

R. E.

Erdberg

(2011). Rorschach performance assessment system: Administration, coding, interpretation, and technical manual. Rorschach Performance Assessment System LLC.

40.

Mihura

J. L.

(2012). The necessity of multiple test methods in conducting assessments: The role of the Rorschach and self-report. Psychological Injury and Law, 5(2), 97–106. https://doi.org/10.1007/s12207-012-9132-9

41.

Mihura

J. L.

Meyer

G. J.

Dumitrascu

Bombel

(2013). The validity of individual Rorschach variables: Systematic reviews and meta-analyses of the Comprehensive System. Psychological Bulletin, 139(3), 548–605. https://doi.org/10.1037/a0029406

42.

Mohammed

Kora

(2023). A comprehensive review on ensemble deep learning: Opportunities and challenges. Journal of King Saud University–Computer and Information Sciences, 35(2), 757–774. https://doi.org/10.1016/j.jksuci.2023.01.014

43.

Morales

Raman

(2025, February 26–March 1). Prompt-engineering strategies for minimizing bias in large language model outputs: Applications in computing education. In Proceedings of the 56th ACM Technical Symposium on Computer Science Education (Vol. 2, 1743–1743). Association for Computing Machinery. https://doi.org/10.1145/3641555.3705080

44.

Organisciak

Acar

Dumas

Berthiaume

(2023). Beyond semantic distance: Automated scoring of divergent thinking greatly improves with large language models. Thinking Skills and Creativity, 49, 101356. https://doi.org/10.1016/j.tsc.2023.101356

45.

Park

J. Y.

Seo

E. H.

Yoon

H.-J.

Won

Lee

K. H.

(2023). Automating Rey Complex Figure Test scoring using a deep learning-based approach: A potential large-scale screening tool for cognitive decline. Alzheimer’s Research & Therapy, 15(1), 145. https://doi.org/10.1186/s13195-023-01283-w

46.

Pimentel

R. P. F. A.

Meyer

G. J.

(2025, May 8–10). Assessing manifestations of depressive behavior during the Rorschach task: Study 3 [Coordinated Session]. 2nd Annual Rorschach Performance Assessment System (R-PAS) in Multimethod Assessment Conference, Portland, OR, United States.

47.

Priyadarshana

Y. H. P. P.

Senanayake

Liang

Piumarta

(2024). Prompt engineering for digital mental health: A short review. Frontiers in Digital Health, 6, Article 1410947. https://doi.org/10.3389/fdgth.2024.1410947

48.

Putica

Khanna

Bosl

Saraf

Edgcomb

(2025). Ethical decision-making for AI in mental health: The integrated ethical approach for computational psychiatry (IEACP) framework. Psychological Medicine, 55, Article e213. https://doi.org/10.1017/S0033291725101311

49.

Qiu

Sun

Shao

Dai

Huang

(2020). Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 63(10), 1872–1897. https://doi.org/10.1007/s11431-020-1647-3

50.

Rathje

Mirea

D.-M.

Sucholutsky

Marjieh

Robertson

C. E.

Van Bavel

J. J.

(2024). GPT is an effective tool for multilingual psychological text analysis. Proceedings of the National Academy of Sciences, 121(34), Article e2308950121. https://doi.org/10.1073/pnas.2308950121

51.

R Core Team. (2025). R: A language and environment for statistical computing [Manual]. R Foundation for Statistical Computing. https://www.R-project.org/

52.

Rosoł

Gąsior

J. S.

Łaba

Korzeniewski

Młyńczak

(2023). Evaluation of the performance of GPT-3.5 and GPT-4 on the polish medical final examination. Scientific Reports, 13(1), 20512. https://doi.org/10.1038/s41598-023-46995-z

53.

Saretzki

Knopf

Forthmann

Goecke

Jaggy

A.-K.

Benedek

Weiss

(2025). Scoring German alternate uses items applying large language models. Journal of Intelligence, 13(6), 64. https://doi.org/10.3390/jintelligence13060064

54.

Schmidt

F. L.

Hunter

J. E.

(2015). Methods of meta-analysis: Correcting error and bias in research findings. Sage. https://doi.org/10.4135/9781483398105

55.

Shapley

Grofman

(1984). Optimizing group judgmental accuracy in the presence of interdependencies. Public Choice, 43(3), 329–343.

56.

Shen

Houser

Smith

D. V.

Murty

V. P.

(2023). Machine-learning as a validated tool to characterize individual differences in free recall of naturalistic events. Psychonomic Bulletin & Review, 30(1), 308–316. https://doi.org/10.3758/s13423-022-02171-4

57.

Shrout

P. E.

Fleiss

J. L.

(1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428.

58.

Snow

O’Connor

Jurafsky

(2008, October 25–27). Cheap and fast – but is it good? Evaluating non-expert annotations for natural language tasks. In Lapata

H. T.

(Eds.), Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (pp. 254–263). Association for Computational Linguistics. https://aclanthology.org/D08-1027/

59.

Swets

J. A.

(1988). Measuring the accuracy of diagnostic systems. Science, 240(4857), 1285–1293. https://doi.org/10.1126/science.3287615

60.

Tanana

Hallgren

K. A.

Imel

Z. E.

Atkins

D. C.

Srikumar

(2016). A comparison of natural language processing methods for automated coding of motivational interviewing. Journal of Substance Abuse Treatment, 65, 43–50. https://doi.org/10.1016/j.jsat.2016.01.006

61.

Viglione

D. J.

Blume-Marcovici

A. C.

Miller

H. L.

Giromini

Meyer

G. J.

(2012). An inter-rater reliability study for the Rorschach Performance Assessment System. Journal of Personality Assessment, 94(6), 607–612. https://doi.org/10.1080/00223891.2012.684118

62.

Viglione

D. J.

Meyer

G. J.

(2008). An overview of Rorschach psychometrics for forensic practice. In Gacono

C. B.

Evans

F. B.

Kaser-Boyd

Gacono

L. A.

(Eds.), Handbook of forensic Rorschach psychology (pp. 21–53). Erlbaum.

63.

Wang

Dsouza

Lee

Apperly

Devine

Van Der Kleij

Lee

(2025, May 3). Automatic scoring of an open-response measure of advanced mind-reading using large language models. In Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025) (pp. 79–89). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.clpsych-1.7

64.

Wei

Wang

Schuurmans

Bosma

Ichter

Xia

Chi

E. H.

Q. V.

Zhou

(2022, November 28–December 9). Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, Nips ’22 (pp. 24824–24837). Curran Associates, Inc.

65.

Zhui

Fenghe

Xuehu

Qining

Wei

(2024). Ethical considerations and fundamental principles of large language models in medical education: Viewpoint. Journal of Medical Internet Research, 26, Article e60083. https://doi.org/10.2196/60083

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.02 MB

0.03 MB

0.00 MB