Large language model prompt engineering for medical education: A practical guide for the Australian and New Zealand College of Anaesthetists Final Examination

Abstract

Large language models (LLMs) can achieve passing scores in specialist-level examinations, yet their capacity to author high-stakes examination content remains under-explored. Compared with published human benchmarks, LLMs create questions roughly 10 times faster, and contextual memory across sessions enables rapid diversification of topic coverage. Prompt design therefore emerges as an academic, not merely technical, craft. This article synthesises the emerging literature on artificial intelligence-assisted item writing and illustrates, through stepwise experimentation with OpenAI GPT-4o, how deliberate prompt engineering can align LLM output with the high standards of medical examinations. Key strategies explored include defining clinical context, imposing structural constraints, supplying exemplar items, assigning examiner roles, sequencing chain-of-thought instructions and requesting rationales. In a worked example, these approaches are layered sequentially while generating short-answer questions mapped to Australian and New Zealand College of Anaesthetists curriculum statements. An evidence-based approach to LLM use for question generation could markedly reduce examiner workload, provide educational integrity and enrich item banks.

Keywords

Large language models prompt engineering question generation medical education medical examination artificial intelligence

Introduction

Large language models (LLMs), a form of artificial intelligence (AI), are being rapidly adapted for use in various medical fields, with a recent application in the improvement of patient education materials.¹ Medical education is facing increasing demand for efficient and high-quality assessments, driven by workforce pressures and growing candidate numbers.² Traditional question-writing processes are time-consuming and labour-intensive with AI tools having the potential to reduce examiner workload and enhance the quality of educational programmes. A critical consideration for LLM utility is ‘prompt engineering’, the process by which users construct inputs to guide LLM outputs. Effective prompt design is essential to ensure accuracy, relevance and alignment with curricular objectives.³

Our aim in this article is to synthesise current evidence in prompt design for medical education and propose practical, literature-informed strategies for prompting LLMs to generate constructed response questions (CRQs) for medical examinations.

Methods

We performed a two‑phase mixed‑methods study comprising: 1) a narrative review of published guidance on AI‑assisted item generation in health education, and 2) an experimental prompt engineering exercise using OpenAI Generative Pre-trained Transformer 4 Omni (GPT‑4o) to craft CRQs aligned with the Australian and New Zealand College of Anaesthetists (ANZCA) Final Examination curriculum. All work was completed between 3 July and 31 July 2025.

Narrative review

A literature search of MEDLINE, Embase and ERIC was undertaken in July 2025 using the string (“large language model” OR “ChatGPT” OR “generative AI”) AND (“examination”) AND (“prompting”) AND (“medical”). Articles describing empirical use of LLMs for generating assessment items in any health professional context were included. The review was undertaken by the primary author, including published abstracts and full text. Results were analysed on study setting, LLM version, prompt strategies and evaluation of LLM outputs.

Prompt engineering protocol

A six‑layer prompt engineering framework (Table 1) was constructed a priori from the literature review. All prompts were executed in a new GPT‑4o session with temperature 0.7, maximum tokens 1024 and default system instructions disabled to minimise hidden bias or contextual memory. These parameters were selected to balance creativity and accuracy, while ensuring reproducibility for large scale implementation.

Table 1.

Generative pre-training Transformer 4 Omni outputs based on prompts provided for short answer question generation.

Prompt design	GPT-4o generated question
Prompt number 1	You are the anaesthetist for a 65-year-old male with COPD and a 50-pack-year smoking history who is scheduled for a right upper lobectomy via video-assisted thoracoscopic surgery (VATS) for non-small cell lung cancer. a) List four preoperative assessments or investigations specific to thoracic surgery you would consider to evaluate this patient’s perioperative risk. (Four marks) b) Describe the key considerations for one-lung ventilation (OLV) in this patient, including your approach to lung isolation and intraoperative management. (Six marks) c) Outline your postoperative concerns and management priorities for this patient in the first 24 hours. (Five marks) d) What are two complications specific to VATS lobectomy that differ from open thoracotomy? (Two marks) Total: 17 marks
Prompt numbers 1 and 2	You are providing anaesthesia for a 59-year-old female undergoing a left lower lobectomy for bronchiectasis via thoracotomy. She has a BMI of 35 and a history of type 2 diabetes and osteoarthritis. a) Outline three techniques you would use to position the patient for a posterolateral thoracotomy and minimise risk of position-related injury. (Three marks) b) List four complications that may arise from the lateral decubitus position, and explain how you would mitigate each. (Four marks) c) What are the indications for one-lung ventilation (OLV) in this case? (Two marks) d) List four absolute or relative contraindications to OLV and briefly explain why they pose concern. (Four marks) e) Describe how you would confirm and optimise lung isolation after placement of a left-sided double lumen tube. (Four marks) Total: 17 marks
Prompt numbers 1 to 3	A 62-year-old male is scheduled for a right upper lobectomy via thoracotomy. a) Describe the key principles of patient positioning for thoracic surgery and strategies to minimise positioning-related injury. (50%) b) Outline the indications and contraindications for one-lung ventilation in this patient. (50%)
Prompt numbers 1 to 4	Describe the techniques used to position a patient for thoracic surgery and strategies to minimise the risk of postoperative position-related injury. (50%) Discuss the indications and contraindications for one-lung ventilation in thoracic surgery. (50%)
Prompt numbers 1 to 5	Describe the techniques used to position a patient for thoracotomy and the strategies employed to minimise the risk of position-related injury. (50%) Discuss the indications and contraindications for one-lung ventilation in thoracic surgery. (50%)
Prompt numbers 1 to 6	Describe the techniques used to position a patient for lateral thoracotomy and strategies to minimise the risk of postoperative position-related injury. (50%) Discuss the indications and contraindications for one-lung ventilation (OLV) in thoracic surgery. (50%) This question directly assesses applied knowledge expected of a Day 1 consultant, integrating anatomical, physiological and procedural understanding relevant to thoracic anaesthesia. Clinical relevance: Thoracic surgery presents unique anaesthetic challenges that are high-stakes and routine for consultant anaesthetists. The lateral decubitus position, required for thoracotomy, carries well-known risks including brachial plexus injury, dependent lung atelectasis, pressure injuries and ocular complications. Awareness and mitigation strategies are vital, particularly in elderly or comorbid patients. OLV is essential in thoracic anaesthesia for surgical exposure and lung protection. However, its use carries significant risks in patients with limited pulmonary reserve or right-sided cardiac pathology. Consultant-level knowledge must include patient selection, contraindications (e.g. severe contralateral lung disease) and troubleshooting hypoxaemia. This knowledge underpins critical preoperative decision-making, intraoperative safety and multidisciplinary communication.

Refer to Supplemental material Table S2 online for samples of each prompt (1–6).

GPT-4o: Generative Pre-trained Transformer 4 Omni; COPD: chronic obstructive pulmonary disease; BMI: body mass index (kg/m²)

Ethical and artificial intelligence disclosure statement

No human or patient data were involved; institutional ethics review was therefore waived. AI was used to undertake the study design with prompt output as the primary project outcome. ChatGPT-4o was used to assist with readability and clarity. The authors have reviewed and take responsibility for the work.

Results

Narrative review output—prompting LLMs to generate examination questions

We derived six major principles for prompt engineering through thematic synthesis of recommendations identified in the narrative review. Recurrent strategies were grouped and refined into six final principles, through author consensus, which represent key strategies for maximising the effectiveness of LLMs in medical assessment design^4–6 (Supplemental material Table S1 online).

Context

Establishing the context of the question, including clinical domain, examination style, and curricular alignment, is the foundation for successful prompt design. Prompting LLMs to specific examination formats, for the varied postgraduate medical examinations, can result in outputs that closely mimic the expected style and complexity of an ideal question for the specific examination.

Constraints and format guidance

Constraining the format and length of generated content ensures consistency with exam requirements. This may include instructing the LLM to produce a short answer question (SAQ) with a clinical vignette, followed by specific marking criteria.

Example-based prompting

Providing examples within the prompt, also known as few-shot learning, significantly enhances output quality. Including in the prompt a sample question that mirrors an ideal examination question style can help the LLM to mimic key features of question writing required for repeated application.

Role assignment

Role assignment refers to defining the identity of the LLM as a specific persona in order to improve focused question design. Prompting the model to adopt a persona (expert, educator or examiner, e.g. senior anaesthetist) leads to more disciplined and contextually appropriate responses.

Promoting reasoning and justification

Prompting the model to provide rationales or justification for its outputs enhances transparency and authenticity. By providing this additional information, educators are able to verify alignment with current clinical practice, literature and intended learning outcomes.

Structure

Finally, structured prompting using chain-of-thought techniques refers to providing the prompt in logical progression. Stepwise templates guide the model through sequential components of question generation, such as beginning with defining the clinical context, then incorporating constraints, then providing examples, in order to gradually refine the desired output.

GPT-4o prompts for ANZCA Final Examination questions

To demonstrate the impact of using structured prompt engineering on AI-assisted question generation, we applied these principles using GPT-4o to develop SAQs relevant to the ANZCA Final Examination. Each successive prompt layer added a new design element, beginning with a general clinical context (Prompt 1), followed by explicit curricular alignment (Prompt 2), structural constraints (Prompt 3), sample exam questions (Prompt 4), role assignment (Prompt 5) and finally a request for reasoning and justification (Prompt 6) (Supplemental material Table S2).

The evolution of the AI-generated content demonstrates a clear requirement for thorough prompt engineering (Table 1). Early prompts yielded longer, multi-part clinical scenarios, with additional prompting shifting the question design to a concise, dual-component question with explicit mark weightings, mirroring the structure seen in official ANZCA Final Examination reports.

Furthermore, we observed the model retained and built upon previous input with improved question generation earlier in the process when the same prompts were run with different question domains (see Supplemental material). This suggests that prompt layering not only refines individual responses but also improves subsequent content generation.

Discussion

This article highlights that prompt design is not merely a technical task but a pedagogical exercise, requiring educators to articulate context, intent, structure, role assignment and expectations with precision. While LLMs offer powerful capabilities to support assessment development, their successful integration into high-stakes contexts, such as the ANZCA Final Examination, demands ongoing validation and a system of expert review. Rather than replacing human expertise, LLMs should be viewed as tools that can enhance the efficiency, diversity and academic value of examination content. An understanding of the appropriate use of these tools and their limitations, and training educators on appropriate application, will help realise the full potential of AI-supported assessment in health professional education.

Supplemental Material

sj-docx-1-aic-10.1177_0310057X261453395 – Supplemental material for Large language model prompt engineering for medical education: A practical guide for the Australian and New Zealand College of Anaesthetists Final Examination

Supplemental material, sj-docx-1-aic-10.1177_0310057X261453395 for Large language model prompt engineering for medical education: A practical guide for the Australian and New Zealand College of Anaesthetists Final Examination by Timothy J. Trewren, Galina Gheihman, Kelly Bratkovic, D-Yin Lin, Stewart Anderson, Dario Winterton, Christina Gao and Brandon Stretton in Anaesthesia and Intensive Care

Footnotes

Author contributions

Timothy J. Trewren: Data curation; Formal analysis; Investigation; Methodology; Writing – original draft.

Galina Gheihman: Formal analysis; Writing – review & editing.

Kelly Bratkovic: Conceptualisation; Writing – review & editing.

D-Yin Lin: Conceptualization; Writing – review & editing.

Stewart Anderson: Conceptualisation; Writing – review & editing.

Dario Winterton: Formal analysis; Writing – review & editing.

Christina Gao: Methodology; Writing – review & editing.

Brandon Stretton: Data curation; Investigation; Methodology; Project administration; Supervision; Writing – original draft.

Artificial intelligence statement

GPT-4o was used to undertake the study design with prompt output as the primary project outcome. ChatGPT was used to assist with readability and clarity. The authors have reviewed and take responsibility for the work.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Ethical considerations

Ethical approval was not required for this study since it involved the interrogation of a large language model with investigator-created content.

ORCID iDs

Timothy J. Trewren

Galina Gheihman

Supplemental material

Supplemental material for this article is available online.

References

Chan

LK.

Enhancing the readability of anaesthesia-related patient education materials using artificial intelligence. Anaesth Intensive Care 2025; 53: 418–419. doi:10.1177/0310057X251344625

Gilardi

Alizadeh

Kubli

ChatGPT outperforms crowd workers for text-annotation tasks. Proc Natl Acad Sci U S A 2023; 120: e2305016120. doi:10.1073/pnas.2305016120

Artsi

Sorin

Konen

, et al. Large language models for generating medical examinations: Systematic review. BMC Med Educ 2024; 24: 354. doi:10.1186/s12909-024-05239-y

Chen

Zhang

Langrené

, et al. Unleashing the potential of prompt engineering for large language models. arXiv 2023; 2310.14735. doi:10.48550/arXiv.2310.14735

Sahoo

Singh

Saha

, et al. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv 2024; 2402.07927. doi:10.48850/arXiv.2402.07927

Marvin

Raudha

Jjingo

, et al. Prompt engineering in large language models. In: Jacob

Piramuthu

Falkowski-Gilski

(eds) Data Intelligence and Cognitive Informatics. Singapore: Springer Nature, 2024, pp.387–402. doi:10.1007/978-99-7962-2_3

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.05 MB