Abstract
Background. Patients increasingly use the Internet and artificial intelligence (AI) platforms ChatGPT for medical information, raising concerns about the accuracy and clinical depth of AI-generated content. This study evaluated the reliability and clinical utility of ChatGPT (GPT-3.5 and GPT-4.0) for common foot and ankle conditions compared with patient education materials from the American Orthopaedic Foot & Ankle Society (AOFAS) FootCareMD. Methods. Between January 20 and 26, 2025, standardized prompts were used to query GPT-3.5 and GPT-4.0 across 15 common foot and ankle conditions. ChatGPT responses were compared with AOFAS FootCareMD content based on the number of symptoms, risk factors, and treatment options provided. Two fellowship-trained foot and ankle orthopaedic surgeons independently evaluated response accuracy, categorizing outputs as <50%, 50% to 74%, 75% to 99%, or 100% accurate. Paired t-tests were used for statistical comparisons, and inter-rater reliability was assessed using Cohen’s weighted kappa. Results. GPT-4.0 generated significantly more symptoms than AOFAS content (P = .015). In contrast, GPT-3.5 listed significantly fewer treatment options than both AOFAS and GPT-4.0 (P = .042). When addressing surgical management, both ChatGPT versions frequently provided vague or incomplete information. GPT-3.5 referenced surgery without procedural detail in 53% of responses, while GPT-4.0 lacked detailed surgical explanations or omitted them entirely in 80% of responses. Overall accuracy ratings were high, with 77% of responses judged as 75% to 99% accurate and only 3.4% rated below 50% accuracy. However, inter-rater agreement between surgeons was poor (κ = −0.02), for responses labeled as 100% accurate, highlighting subjectivity in grading AI-generated medical content. Conclusion. ChatGPT effectively provides general information on foot and ankle conditions, regarding causes and symptoms, and GPT-4.0 offers more comprehensive treatment discussions than GPT-3.5. Nevertheless, its limited depth and specificity regarding surgical options restrict its clinical usefulness. Until further improvements are made, AI-generated content should serve as a supplement rather than a replacement for expert-reviewed patient education resources.
Level of Evidence: Level III Case Control Study
Get full access to this article
View all access options for this article.
