Abstract
Large language models (LLMs) can achieve passing scores in specialist-level examinations, yet their capacity to author high-stakes examination content remains under-explored. Compared with published human benchmarks, LLMs create questions roughly 10 times faster, and contextual memory across sessions enables rapid diversification of topic coverage. Prompt design therefore emerges as an academic, not merely technical, craft. This article synthesises the emerging literature on artificial intelligence-assisted item writing and illustrates, through stepwise experimentation with OpenAI GPT-4o, how deliberate prompt engineering can align LLM output with the high standards of medical examinations. Key strategies explored include defining clinical context, imposing structural constraints, supplying exemplar items, assigning examiner roles, sequencing chain-of-thought instructions and requesting rationales. In a worked example, these approaches are layered sequentially while generating short-answer questions mapped to Australian and New Zealand College of Anaesthetists curriculum statements. An evidence-based approach to LLM use for question generation could markedly reduce examiner workload, provide educational integrity and enrich item banks.
Keywords
Introduction
Large language models (LLMs), a form of artificial intelligence (AI), are being rapidly adapted for use in various medical fields, with a recent application in the improvement of patient education materials. 1 Medical education is facing increasing demand for efficient and high-quality assessments, driven by workforce pressures and growing candidate numbers. 2 Traditional question-writing processes are time-consuming and labour-intensive with AI tools having the potential to reduce examiner workload and enhance the quality of educational programmes. A critical consideration for LLM utility is ‘prompt engineering’, the process by which users construct inputs to guide LLM outputs. Effective prompt design is essential to ensure accuracy, relevance and alignment with curricular objectives. 3
Our aim in this article is to synthesise current evidence in prompt design for medical education and propose practical, literature-informed strategies for prompting LLMs to generate constructed response questions (CRQs) for medical examinations.
Methods
We performed a two‑phase mixed‑methods study comprising: 1) a narrative review of published guidance on AI‑assisted item generation in health education, and 2) an experimental prompt engineering exercise using OpenAI Generative Pre-trained Transformer 4 Omni (GPT‑4o) to craft CRQs aligned with the Australian and New Zealand College of Anaesthetists (ANZCA) Final Examination curriculum. All work was completed between 3 July and 31 July 2025.
Narrative review
A literature search of MEDLINE, Embase and ERIC was undertaken in July 2025 using the string (“large language model” OR “ChatGPT” OR “generative AI”) AND (“examination”) AND (“prompting”) AND (“medical”). Articles describing empirical use of LLMs for generating assessment items in any health professional context were included. The review was undertaken by the primary author, including published abstracts and full text. Results were analysed on study setting, LLM version, prompt strategies and evaluation of LLM outputs.
Prompt engineering protocol
A six‑layer prompt engineering framework (Table 1) was constructed a priori from the literature review. All prompts were executed in a new GPT‑4o session with temperature 0.7, maximum tokens 1024 and default system instructions disabled to minimise hidden bias or contextual memory. These parameters were selected to balance creativity and accuracy, while ensuring reproducibility for large scale implementation.
Generative pre-training Transformer 4 Omni outputs based on prompts provided for short answer question generation.
Refer to Supplemental material Table S2 online for samples of each prompt (1–6).
GPT-4o: Generative Pre-trained Transformer 4 Omni; COPD: chronic obstructive pulmonary disease; BMI: body mass index (kg/m2)
Ethical and artificial intelligence disclosure statement
No human or patient data were involved; institutional ethics review was therefore waived. AI was used to undertake the study design with prompt output as the primary project outcome. ChatGPT-4o was used to assist with readability and clarity. The authors have reviewed and take responsibility for the work.
Results
Narrative review output—prompting LLMs to generate examination questions
We derived six major principles for prompt engineering through thematic synthesis of recommendations identified in the narrative review. Recurrent strategies were grouped and refined into six final principles, through author consensus, which represent key strategies for maximising the effectiveness of LLMs in medical assessment design4–6 (Supplemental material Table S1 online).
Context
Establishing the context of the question, including clinical domain, examination style, and curricular alignment, is the foundation for successful prompt design. Prompting LLMs to specific examination formats, for the varied postgraduate medical examinations, can result in outputs that closely mimic the expected style and complexity of an ideal question for the specific examination.
Constraints and format guidance
Constraining the format and length of generated content ensures consistency with exam requirements. This may include instructing the LLM to produce a short answer question (SAQ) with a clinical vignette, followed by specific marking criteria.
Example-based prompting
Providing examples within the prompt, also known as few-shot learning, significantly enhances output quality. Including in the prompt a sample question that mirrors an ideal examination question style can help the LLM to mimic key features of question writing required for repeated application.
Role assignment
Role assignment refers to defining the identity of the LLM as a specific persona in order to improve focused question design. Prompting the model to adopt a persona (expert, educator or examiner, e.g. senior anaesthetist) leads to more disciplined and contextually appropriate responses.
Promoting reasoning and justification
Prompting the model to provide rationales or justification for its outputs enhances transparency and authenticity. By providing this additional information, educators are able to verify alignment with current clinical practice, literature and intended learning outcomes.
Structure
Finally, structured prompting using chain-of-thought techniques refers to providing the prompt in logical progression. Stepwise templates guide the model through sequential components of question generation, such as beginning with defining the clinical context, then incorporating constraints, then providing examples, in order to gradually refine the desired output.
GPT-4o prompts for ANZCA Final Examination questions
To demonstrate the impact of using structured prompt engineering on AI-assisted question generation, we applied these principles using GPT-4o to develop SAQs relevant to the ANZCA Final Examination. Each successive prompt layer added a new design element, beginning with a general clinical context (Prompt 1), followed by explicit curricular alignment (Prompt 2), structural constraints (Prompt 3), sample exam questions (Prompt 4), role assignment (Prompt 5) and finally a request for reasoning and justification (Prompt 6) (Supplemental material Table S2).
The evolution of the AI-generated content demonstrates a clear requirement for thorough prompt engineering (Table 1). Early prompts yielded longer, multi-part clinical scenarios, with additional prompting shifting the question design to a concise, dual-component question with explicit mark weightings, mirroring the structure seen in official ANZCA Final Examination reports.
Furthermore, we observed the model retained and built upon previous input with improved question generation earlier in the process when the same prompts were run with different question domains (see Supplemental material). This suggests that prompt layering not only refines individual responses but also improves subsequent content generation.
Discussion
This article highlights that prompt design is not merely a technical task but a pedagogical exercise, requiring educators to articulate context, intent, structure, role assignment and expectations with precision. While LLMs offer powerful capabilities to support assessment development, their successful integration into high-stakes contexts, such as the ANZCA Final Examination, demands ongoing validation and a system of expert review. Rather than replacing human expertise, LLMs should be viewed as tools that can enhance the efficiency, diversity and academic value of examination content. An understanding of the appropriate use of these tools and their limitations, and training educators on appropriate application, will help realise the full potential of AI-supported assessment in health professional education.
Supplemental Material
sj-docx-1-aic-10.1177_0310057X261453395 – Supplemental material for Large language model prompt engineering for medical education: A practical guide for the Australian and New Zealand College of Anaesthetists Final Examination
Supplemental material, sj-docx-1-aic-10.1177_0310057X261453395 for Large language model prompt engineering for medical education: A practical guide for the Australian and New Zealand College of Anaesthetists Final Examination by Timothy J. Trewren, Galina Gheihman, Kelly Bratkovic, D-Yin Lin, Stewart Anderson, Dario Winterton, Christina Gao and Brandon Stretton in Anaesthesia and Intensive Care
Footnotes
Author contributions
Artificial intelligence statement
GPT-4o was used to undertake the study design with prompt output as the primary project outcome. ChatGPT was used to assist with readability and clarity. The authors have reviewed and take responsibility for the work.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Ethical considerations
Ethical approval was not required for this study since it involved the interrogation of a large language model with investigator-created content.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
