Artificial intelligence in the management of chronic pain and lipedema: A comparative analysis of ChatGPT-5o,Gemini-3,and perplexity AI in terms of readability and academic reliability

Abstract

Objectives

Lipedema is a chronic disorder characterized by pain and disproportionate fat distribution, and its diagnosis is frequently overlooked. The aim of this study was to evaluate and compare the responses generated by contemporary artificial intelligence models—ChatGPT-5o, Gemini-3, and Perplexity AI—to structured clinical questions developed in accordance with the 2024 S2k Lipedema Guideline. The models were analyzed in terms of clinical accuracy, readability, and reference reliability to assess their performance in delivering guideline-based medical information.

Methods

This cross-sectional and comparative study was conducted by submitting 30 structured clinical questions, prepared on the basis of the relevant guideline, to three large language models. Responses collected on 10 February 2026, were evaluated using a seven-point Likert scale (reliability) and a five-point scale (accuracy). Text readability was assessed using six established indices, including the Flesch Reading Ease Score (FRES), Flesch–Kincaid Grade Level (FKGL), and Gunning Fog Index (GFOG). Reference reliability was examined by analyzing hallucination tendencies as defined in the literature.

Results

A statistically significant difference in reliability was observed among the models (p = .041); Perplexity (4.95 ± 1.20) achieved significantly higher scores than ChatGPT-5o (4.38 ± 1.05) (p = .038). In readability analyses, Perplexity (12.80 ± 2.10) required a significantly higher educational level according to FKGL scores compared to both ChatGPT-5o (p = .041) and Gemini-3 (p = .036). Regarding reference reliability, ChatGPT-5o outperformed Perplexity in source verifiability (p = .031), bibliographic precision (p = .044), and total RHS scores (p = .027), emerging as the most robust model in this domain. No statistically significant differences were found among the models in terms of clinical accuracy and usefulness (p > .05). Inter-rater agreement was excellent (Kappa: 0.92–0.97).

Conclusion

In this study, ChatGPT-5o distinguished itself in reference quality, whereas Perplexity demonstrated superior reliability. However, the complex linguistic structures accompanying efforts to maintain high medical accuracy may constitute a significant barrier for individuals with limited e-health literacy. Although these systems show strong potential as medical information resources, they cannot yet replace expert physician oversight in terms of patient safety. A balanced approach between technical reliability and patient-centered simplification remains necessary.

Keywords

artificial intelligence ChatGPT lipedema online medical information patient education readability

Introduction

Lipedema is a chronic disorder that predominantly affects women and is characterized by pain and bilateral, symmetrical fat distribution of the extremities. Despite its distinct clinical features, it is frequently misdiagnosed. Within the pathophysiology of lipedema, pain is regarded not only as a core symptom but also as a diagnostic prerequisite. Current evidence suggests that pain may arise from localized inflammation, with persistent inflammatory activity potentially triggering fibrotic tissue changes. Clinical studies have reported tenderness to palpation in all patients (100%), while spontaneous pain has been observed with a high prevalence of 82%.^1,2

This clinical presentation—marked by pain, pressure sensitivity, and easy bruising—may overlap with obesity, lymphedema, and chronic venous disorders, leading to diagnostic delays and substantial variability in therapeutic approaches. In the management of lipedema, accurate diagnostic evaluation and the timely implementation of appropriate conservative or surgical treatment options are critical for preserving patient quality of life. In this context, evidence-based clinical guidelines serve as essential reference frameworks to standardize diagnostic and therapeutic processes and to reduce variability in clinical practice.^3–6

One of the most recent and comprehensive guidance documents addressing lipedema is the German S2k Lipedema Guideline (2024), which provides a multidisciplinary consensus on the definition, epidemiology, differential diagnosis, staging, and treatment strategies of the disease. In clinical decision-making processes, structured and evidence-based guidelines of this nature play a pivotal role in ensuring patient safety and optimizing treatment effectiveness.⁷

In recent years, rapid advancements in artificial intelligence (AI) technologies have introduced new opportunities for the evaluation and management of chronic diseases. In particular, large language models (LLMs) have attracted attention as potential clinical decision-support tools due to their capacity to synthesize extensive medical literature and generate structured, natural-language responses to clinical queries. LLM-based systems such as ChatGPT, Gemini, and Perplexity are capable of rapidly summarizing diagnostic criteria, outlining differential diagnoses, and comparing therapeutic options.^8,9

However, in conditions such as lipedema—where clinical overlap is common and multidisciplinary management is required—the guideline adherence and recommendation consistency of LLM-generated outputs are of critical importance. The provision of inaccurate or incomplete clinical information may result in errors in differential diagnosis, while ambiguous recommendation levels or guideline-inconsistent treatment statements may pose risks to patient safety. Furthermore, the generation of unverifiable or fabricated references (reference hallucination) represents a significant limitation affecting the credibility of these systems.^10,11

Given the increasing use of large language models by clinicians, a systematic evaluation of their performance in delivering guideline-based information on lipedema is warranted. Accordingly, this study aims to analyze the responses generated by ChatGPT, Gemini, and Perplexity to structured, open-ended clinical questions derived from the 2024 S2k Lipedema Guideline—encompassing definitions, diagnosis, differential diagnosis, staging, and treatment recommendations—using a multidimensional evaluation framework assessing clinical accuracy, guideline adherence, reliability, readability, and potential reference hallucination.

Materials and methods

Study design

This study was designed as a cross-sectional and comparative methodological analysis to examine the extent to which large language models accurately, reliably, and in a guideline-consistent manner present recommendations for the diagnosis and treatment of lipedema. The investigation was based exclusively on the evaluation of written textual outputs. No patients, volunteers, or clinical datasets were involved, and no interventions were performed on humans or animals. Therefore, ethics committee approval was not required. Similar methodological approaches have been employed in previously published studies.¹² The research process was conducted in accordance with the ethical principles outlined in the Declaration of Helsinki.

Selection and rationale of the clinical reference

The open-ended clinical questions used in this study were developed on the basis of an evidence-based and consensus-driven clinical reference document addressing the diagnosis and treatment of lipedema. For this purpose, the multidisciplinary guideline titled S2k Guideline: Lipedema (2024) was selected as the primary reference source.⁷

This guideline was chosen because it provides explicit and structured recommendations regarding the definition, diagnostic criteria, staging, differential diagnosis, and treatment options of lipedema; encompasses both conservative and surgical approaches; and represents an internationally recognized consensus document. Furthermore, its systematic and recommendation-based structure offers a suitable framework for objectively evaluating the guideline interpretation performance of large language models.⁷

Development of guideline-based open-ended questions

The S2k Lipedema guideline was reviewed in detail, and sections containing structured recommendation statements and consensus declarations were included in the study scope.

Following the methodological approach described by Liu et al.,¹³ a question pool was generated to represent the principal clinical decision domains addressed in the guideline, and 30 open-ended clinical questions were selected for final analysis.

The questions were structured to cover the following domains: definition and pathophysiology, diagnostic criteria, differential diagnosis, staging and classification, conservative treatment, and surgical treatment. The question development process was conducted by two physicians with clinical experience in lipedema management. Each question was reformulated to preserve the original clinical intent and core meaning of the guideline while ensuring clarity and direct answerability by large language models. Discrepancies were resolved through mutual discussion until consensus was achieved.

To enable standardized evaluation of both clinical content and reference accuracy, the following uniform prompt was appended to each question:

“Please first answer the question directly without specifying levels of evidence. Then, provide a reference list consisting of exactly 10 scientific sources supporting your statements. For each source, include the title, authors, journal name, publication year, DOI, and a direct link.”

The complete list of open-ended clinical questions derived from the guideline and used in the study is presented in Table 1.

Table 1.

Structured questions based on the S2k Lipedema guideline.

Section	No	Question
I. Definition & core diagnostic characteristics	1	According to the S2k guideline, how is lipedema defined?
	2	What is the typical fat distribution pattern in lipedema?
	3	What is the diagnostic significance of foot sparing in lipedema?
	4	How are spontaneous pain and tenderness described in the guideline?
	5	Is lipedema characteristically bilateral and symmetrical?
	6	Is easy bruising described as a feature of lipedema?
II. Diagnostic approach	7	Is lipedema diagnosis primarily clinical, or is imaging mandatory according to the S2k guideline?
	8	Which elements should be included in the medical history?
	9	What are the key physical examination findings to be assessed?
	10	Do laboratory tests have a routine role in the diagnosis of lipedema?
	11	Is ultrasonography mandatory for diagnosis?
	12	Is MRI routinely recommended in lipedema diagnosis?
III. Differential diagnosis	13	What is the main difference between lipedema and obesity according to the guideline?
	14	How is lipedema differentiated from lymphedema?
	15	Is the stemmer sign typically positive in lipedema?
	16	What is the key distinction between lipedema and lipohypertrophy?
	17	How is chronic venous disease distinguished from lipedema?
IV. Staging & classification	18	How are lipedema stages defined in the S2k guideline?
	19	Is staging based on skin surface morphology?
	20	Is type classification based on anatomical distribution?
	21	Should staging be considered in treatment planning?
V. Conservative management (recommendation-based)	22	Is a multimodal conservative treatment approach recommended?
	23	Is compression therapy recommended to reduce symptoms?
	24	Is complex physical decongestive therapy (CDT) routinely indicated?
	25	Is exercise recommended as part of lipedema management?
	26	Is weight management recommended in treatment?
	27	Should psychosocial support be included in the treatment plan?
VI. Surgical management	28	Is liposuction recommended as a treatment option?
	29	Should surgical treatment be considered after failure of conservative therapy?
	30	Must conservative therapy be attempted prior to surgery?

Large language models evaluated and response collection

Three large language models accessible through publicly available free web interfaces were evaluated:

ChatGPT-5o (OpenAI, San Francisco, CA, USA)

Gemini-3 (Google LLC, Mountain View, CA, USA)

Perplexity AI (Perplexity AI Inc., USA)

Premium versions, paid subscription plans, or API-based access methods were not used. This decision was made to reflect real-world user conditions and to evaluate platforms accessible across different socioeconomic groups.¹⁴

All open-ended questions were submitted during a single session period on the same day (10 February 2026), under comparable technical conditions. To ensure comparability, the identical question set and standardized prompt were used for each model. No modifications were made to the generated responses; outputs were recorded verbatim for analysis.

To minimize user-related bias, separate and independent sessions were created for each model. Prior to each session, browser history and cookies were cleared, user accounts were logged out, and all interactions were conducted in incognito mode. To prevent carryover effects, each question was presented in a separate conversation window, ensuring independent interactions.

Evaluation of model responses

The responses generated by each large language model were independently and blindly evaluated by two experienced physicians.

Reliability and practical usefulness were assessed using a previously defined seven-point Likert scale,¹⁵ where higher scores indicate greater reliability and usefulness.

Each response was additionally evaluated using a five-point accuracy scale ranging from 1 (completely incorrect content) to 5 (completely accurate and guideline-consistent content).¹⁶

In cases of scoring discrepancies between evaluators, a final decision was reached through consensus. Inter-rater agreement was analyzed using Cohen’s kappa coefficient.

Readability and reference reliability analysis

The linguistic accessibility of LLM responses was assessed using complementary readability indices based on different mathematical formulations. The following indices were calculated: Flesch Reading Ease Score (FRES), Flesch–Kincaid Grade Level (FKGL), Gunning Fog Index (GFOG), Simple Measure of Gobbledygook (SMOG), Automated Readability Index (ARI), and Coleman–Liau Index (CLI). These indices evaluate text comprehensibility and the approximate educational level required for content understanding using distinct parameters (Table 2).

Table 2.

Readability formulas.

Readability index	Description	Formula
Flesch Reading Ease Score (FRES)	It was developed to evaluate the readability of newspapers. It is best suited for evaluating school textbooks and technical manuals. The standardized test used by many US government agencies. Scores range from 0 to 100, with higher scores indicating easier readability	I = (206.835−(84.6 × (B/W))−(1.015 × (W/S)))
Flesch–Kincaid grade level (FKGL)	Part of the Kincaid navy personnel test collection. Designed for technical documentation and suitable for a wide range of disciplines	G = (11.8 × (B/W)) + (0.39 × (W/S)) −15.59
Simple Measure of Gobbledygook (SMOG)	It is generally suitable for middle-aged (4th grade to college level) readers. While testing 100% comprehension, most formulas test about 50%–75% comprehension. Most accurate when applied to documents ≥30 sentences long.	G = 1.0430 × √C + 3.1291
Gunning FOG (GFOG)	It was developed to help American businesses improve the readability of their writing. Applicable to many disciplines	G = 0.4 × (W/S+((C∗/W) × 100))
Coleman–Liau (CL) score	It is designed for middle-aged (4th grade to college level) readers. The formula is based on text in the grade level range of 0.4 to 16.3. It applies to many industries.	G = (−27.4004 × (E/100)) + 23.06395
Automated readability index (ARI)	ARI has been used by the military in writing technical manuals, and its calculation returns a grade level necessary for understanding.	ARI = 4.71 × l + 0.5*ASL−21.43

W: number of words; S: number of sentences; B: number of syllables; C: number of polysyllabic words (≥3 syllables); l: average number of letters per word; ASL: average sentence length (number of words); E: average number of letters per 100 words.

Calculations were independently performed using two online readability tools (readabilityformulas.com and onlineutility.org). For each response, values obtained from both platforms were recorded separately, and final model-level scores were derived by averaging the results from the two sources. Findings were presented as median (minimum–maximum) for each model.⁸

The accuracy and reliability of references generated by the LLMs were evaluated using a framework previously defined in the literature. Based on the approach proposed by Aljamaan et al.,¹⁰ the following criteria were assessed: consistency of publication year, journal validity, title accuracy, author information, topical relevance, link functionality, and accessibility. Higher scores indicated a greater tendency toward reference hallucination.

Statistical analysis

Statistical analyses were performed using SPSS for Windows version 24.0 (SPSS Inc., Chicago, IL, USA). Numerical data were presented as mean ± standard deviation and median (minimum–maximum). Inter-model comparisons were conducted using paired-samples t-tests or Wilcoxon signed-rank tests, depending on data distribution. Inter-rater agreement was analyzed using Cohen’s kappa coefficient. A p-value <.05 was considered statistically significant.

Results

Following evaluation of the responses generated by ChatGPT-5o, Gemini-3, and Perplexity to the 30 open-ended clinical questions derived from the S2k Lipedema Guideline, statistically significant inter-model differences were identified across selected outcome measures.

Descriptive statistics related to clinical performance and content alignment are presented in Table 3. A statistically significant difference was observed among the models in terms of reliability (p = .041). In Bonferroni-adjusted pairwise comparisons, Perplexity (4.95 ± 1.20) demonstrated significantly higher reliability scores than ChatGPT-5o (4.38 ± 1.05) (p = .038). The differences between Gemini-3 (4.62 ± 0.88) and ChatGPT-5o (p = .214), as well as between Gemini-3 and Perplexity (p = .172), were not statistically significant. No statistically significant differences were identified among the models with respect to usefulness (p = .058) or clinical accuracy (p = .074).

Table 3.

Comparison of clinical performance metrics.

Variable	ChatGPT-5o	Gemini-3	Perplexity	p
Reliability (1–7)	4.38 ± 1.05^a	4.62 ± 0.88^a	4.95 ± 1.20^a	0.041
Usefulness (1–7)	4.22 ± 0.95	4.48 ± 0.82	4.72 ± 1.10	0.058
Clinical accuracy (1–5)	3.62 ± 0.72	3.81 ± 0.63	3.90 ± 0.78	0.074

Values are presented as mean ± SD (range).

Post-hoc analyses were performed only when the omnibus test was statistically significant. Bold formatting has been used to indicate statistically significant values.

^aDifferent superscript letters indicate statistically significant differences between groups according to Bonferroni-adjusted post-hoc comparisons (p < .05).

Readability analysis results are summarized in Table 4. No statistically significant inter-model differences were detected for FRES (p = .118), GFOG (p = .067), CLI (p = .176), SMOG (p = .061), or ARI (p = .052). However, a significant difference was observed for FKGL scores (p = .048). Post-hoc analyses revealed that Perplexity (12.80 ± 2.10) required a significantly higher educational grade level compared with both ChatGPT-5o (12.05 ± 1.85; p = .041) and Gemini-3 (11.45 ± 1.60; p = .036). The difference between ChatGPT-5o and Gemini-3 was not statistically significant (p = .302).

Table 4.

Comparison of readability metrics.

Variable	ChatGPT-5o	Gemini-3	Perplexity	p
FRES	35.40 ± 10.80 (8.20–56.10)	33.90 ± 9.50 (12.30–49.60)	29.10 ± 11.20 (7.90–53.40)	0.118
GFOG	14.90 ± 2.60 (10.80–19.80)	14.10 ± 2.10 (11.20–17.60)	15.40 ± 2.70 (10.90–20.40)	0.067
FKGL	12.05 ± 1.85^a (8.70–16.20)	11.45 ± 1.60^a (9.40–14.90)	12.80 ± 2.10^a (8.90–16.50)	0.048
CLI	15.55 ± 2.30 (11.10–19.60)	16.05 ± 1.95 (13.70–18.80)	16.30 ± 2.40 (12.20–19.90)	0.176
SMOG	10.45 ± 1.95 (8.10–14.20)	9.75 ± 1.55 (7.90–12.00)	10.85 ± 2.10 (8.20–13.80)	0.061
ARI	13.20 ± 2.45 (9.00–18.60)	12.15 ± 1.90 (10.10–15.40)	13.85 ± 2.30 (9.20–17.10)	0.052

Values are presented as mean ± SD (range). Bold formatting has been used to indicate statistically significant values.

FRES: Flesch Reading Ease Score; GFOG: Gunning Fog Index; FKGL: Flesch–Kincaid Grade Level; CLI: Coleman–Liau Index; SMOG: Simple Measure of Gobbledygook; ARI: Automated Readability Index.

^aDifferent superscript letters indicate statistically significant differences between groups according to Bonferroni-adjusted post-hoc comparisons (p < .05).

Reference reliability findings are presented in Table 5. Statistically significant differences were observed among the models in Presence/Verifiability (p = .032), Bibliographic Accuracy (p = .047), and total RHS score (p = .029). In the Presence/Verifiability subcomponent, the difference between ChatGPT-5o (1.88 ± 0.92) and Perplexity (1.18 ± 1.05) was statistically significant (p = .031), whereas comparisons between ChatGPT-5o and Gemini-3 (1.52 ± 0.87; p = .118) and between Gemini-3 and Perplexity (p = .094) were not significant.

Table 5.

Comparison of reference hallucination score components.

Variable	ChatGPT-5o	Gemini-3	Perplexity	p
Presence/Verifiability (0–3)	1.88 ± 0.92^a	1.52 ± 0.87^a	1.18 ± 1.05^a	0.032
Bibliographic accuracy (0–3)	2.02 ± 0.90^a	1.68 ± 0.95^a	1.25 ± 1.10^a	0.047
PubMed identifier accuracy (0–2)	1.36 ± 0.68	1.14 ± 0.70	0.92 ± 0.78	0.083
Relevance to topic (0–3)	1.94 ± 0.88	1.60 ± 0.92	1.28 ± 1.05	0.051
Total RHS (0–11)	7.20 ± 3.40^a	5.94 ± 3.10^a	4.63 ± 3.60^a	0.029

Values are presented as mean ± SD (range).

RHS: Reference Hallucination Score. Bold formatting has been used to indicate statistically significant values.

^aDifferent superscript letters indicate statistically significant differences between groups according to Bonferroni-adjusted post-hoc comparisons (p < .05). Post-hoc analyses were conducted only when the omnibus test indicated statistical significance.

Regarding Bibliographic Accuracy, ChatGPT-5o (2.02 ± 0.90) outperformed Perplexity (1.25 ± 1.10), with the difference reaching statistical significance (p = .044). However, comparisons between ChatGPT-5o and Gemini-3 (p = .082), as well as between Gemini-3 and Perplexity (p = .109), did not demonstrate statistical significance.

For the total RHS score, the difference between ChatGPT-5o (7.20 ± 3.40) and Perplexity (4.63 ± 3.60) was statistically significant (p = .027). In contrast, differences between ChatGPT-5o and Gemini-3 (5.94 ± 3.10; p = .082) and between Gemini-3 and Perplexity (p = .119) were not statistically significant. No significant inter-model differences were identified for PubMed Identifier (PMID) Accuracy (p = .083) or Relevance to Topic (p = .051).

Inter-rater agreement analysis demonstrated excellent consistency across all evaluated domains. Cohen’s kappa coefficients were calculated as 0.92 for clinical accuracy, 0.95 for completeness, 0.97 for absence of incorrect or missing information, and 0.96 for consistency, indicating near-perfect agreement among evaluators (all p < .001).

Discussion

This study provides a comparative evaluation of contemporary large language models in delivering guideline-based information regarding lipedema, a chronic and frequently misunderstood disorder characterized by pain, tenderness, disproportionate adipose tissue accumulation, and diagnostic complexity. The findings revealed notable inter-model differences in reliability, reference quality, and readability characteristics. Perplexity AI achieved significantly higher reliability scores, whereas ChatGPT-5o demonstrated superior performance in source verifiability, bibliographic precision, and overall reference reliability. Although most readability indices were comparable among the models, all systems generally required relatively high educational levels for comprehension.

An important contextual issue underlying these findings is the limited strength of the existing evidence base in lipedema research itself. Although lipedema has increasingly attracted scientific attention in recent years, the current literature still contains relatively few high-quality original investigations, while opinion papers, narrative reviews, expert consensus documents, and patient-generated online content remain disproportionately abundant. The German S2k Lipedema Guideline, which served as the reference framework for this study, also emphasizes the limited and heterogeneous nature of the available evidence.⁶ Consequently, large language models trained on extensive internet-based datasets may reflect this imbalance within the underlying information environment, in which scientifically robust evidence is relatively scarce compared with a large volume of lay and opinion-based content. Within this context, differences in response characteristics across models may be interpreted as a potential reflection of the balance between scientific caution and linguistic accessibility, rather than a directly measurable causal relationship. Accordingly, more readable outputs may involve simplification of complex or uncertain medical concepts, whereas more academically structured responses may adopt a more cautious and formal language style.

The findings of the present study are generally consistent with previous investigations evaluating online health information and AI-generated medical content in lipedema and related vascular or lymphatic disorders. Earlier studies examining websites, social media platforms, and YouTube videos concerning lipedema and lymphedema have consistently reported deficiencies in readability, transparency, reliability, and scientific quality.^17–22 Similarly, investigations evaluating earlier generations of AI systems in vascular medicine demonstrated variability in guideline adherence and information quality.^23,24 Our findings extend this literature by suggesting that newer-generation models may provide more clinically aligned and academically structured responses; however, important limitations related to accessibility and patient comprehension persist.

The findings of the present study are generally consistent with previous investigations evaluating online health information and AI-generated medical content in lipedema and related vascular or lymphatic disorders. Earlier studies examining websites, social media platforms, and YouTube videos concerning lipedema and lymphedema have consistently reported deficiencies in readability, transparency, reliability, and scientific quality.^17–22 Similarly, investigations evaluating earlier generations of AI systems in vascular medicine demonstrated variability in guideline adherence and information quality.^23,24 More recent studies in vascular and lymphatic diseases have also shown that, despite improvements in overall informational quality, AI-generated medical content may still present important limitations regarding guideline concordance, readability, consistency, and patient comprehension.^25–29 Our findings extend this literature by suggesting that newer-generation models may provide more clinically aligned and academically structured responses; however, important limitations related to accessibility and patient comprehension persist.

These findings are particularly relevant in the context of e-health literacy. Patients increasingly rely on internet-based resources and conversational AI systems to better understand chronic conditions before or after professional consultations.^29,30 In diseases such as lipedema, where delayed diagnosis and misinformation are common, AI systems may offer important advantages by rapidly synthesizing large volumes of medical information into interactive responses.^31,32 Nevertheless, if generated content exceeds the average patient’s reading capacity, the practical benefit of technically accurate information may remain limited. The inverse relationship observed between medical reliability and readability therefore represents not only a linguistic issue, but also a potential equity concern affecting access to understandable healthcare information.

Several limitations should be considered when interpreting the findings of this study. First, the analysis was restricted to 30 structured questions derived from the S2k Lipedema Guideline and did not include patient-generated questions obtained from social media or online communities. Second, the evaluation was conducted exclusively in English and may not reflect language-specific variations in AI performance. Third, only three contemporary AI models were assessed, and the rapidly evolving nature of these systems limits the long-term generalizability of the findings. Finally, because AI platforms undergo frequent algorithmic updates, performance characteristics may vary over time.

Despite these limitations, this study has several important strengths. To our knowledge, this is the first investigation specifically evaluating large language models in the context of lipedema management. In addition to readability, the analysis incorporated multidimensional assessment domains including clinical accuracy, reliability, bibliographic precision, reference verifiability, and hallucination-related tendencies. The simultaneous evaluation of multiple contemporary AI systems using standardized guideline-based questions provides clinically relevant insights regarding the current capabilities and limitations of conversational AI technologies in chronic disease education.

Conclusion

Contemporary large language models demonstrated substantial potential for delivering guideline-based information regarding lipedema. However, an important trade-off persists between scientific rigor and patient-level readability. While academically stronger responses tended to provide more reliable and verifiable information, they also required higher educational levels for comprehension. Given the limited evidence base in lipedema research and the widespread presence of non-scientific online content, AI-generated medical information should still be interpreted cautiously and under professional supervision. Future developments in conversational artificial intelligence should aim not only to improve technical accuracy and reference reliability, but also to optimize accessibility for individuals with varying levels of e-health literacy.

Footnotes

ORCID iD

İlhan Celil Özbek

Ethical considerations

The investigation was based exclusively on the evaluation of written textual outputs. No patients, volunteers, or clinical datasets were involved, and no interventions were performed on humans or animals. Therefore, ethics committee approval was not required. The research process was conducted in accordance with the ethical principles outlined in the Declaration of Helsinki.

Consent to participate

Written informed consent was obtained from all individual participants included in the study.

Author contributions

All authors contributed to the conception and design of the study. Material preparation, data collection, and analysis were performed by all authors. All authors read and approved the final manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability Statement

The datasets generated and/or analyzed during the current study are available from the corresponding author upon reasonable request.*

Declaration of generative AI use in writing

During the preparation of this manuscript, the authors used OpenAI’s ChatGPT (version 5-turbo) to assist in refining the English language, including improvements in clarity, coherence, and academic tone during the revision process. All content, data interpretation, and scientific conclusions were generated, critically reviewed, verified, and approved by the authors, who take full responsibility for the integrity and originality of the published work.

Guarantor

Dr Ilhan Celil Ozbek is the guarantor of this work and takes full responsibility for the integrity of the data and the accuracy of the data analysis.

References

Czerwińska

Gruszecki

Rumiński

, et al. Examining the characteristic features of lipedema and the usefulness of BMI and WHtR in clinical evaluation. BMC Womens Health 2025; 25(1): 292. Published 2025 Jul 3. https://doi.org/10.1186/s12905-025-03834-9

Ozbek

Kuculmez

Dundar Ahi

. Prevalence of sarcopenia and its functional correlates in women with lower-extremity lipedema: a cross-sectional observational study. Phlebology: J Ven Dis. 2026; 2683555261451570. https://doi.org/10.1177/02683555261451570

Buso

Depairon

Tomson

, et al. Lipedema: a call to action. Obesity 2019; 27(10): 1567–1576. https://doi.org/10.1002/oby.22597

Wollina

. Lipedema-An update. Dermatol Ther 2019; 32(2): e12805. https://doi.org/10.1111/dth.12805

Forner-Cordero

Szolnoky

. Update in the management of lipedema. Int Angiol 2021; 40(4): 345–357. https://doi.org/10.23736/S0392-9590.21.04604-6

Herbst

Kahn

Iker

, et al. Standard of care for lipedema in the United States. Phlebology 2021; 36(10): 779–796. https://doi.org/10.1177/02683555211015887

Faerber

Cornely

Daubert

, et al. S2k guideline lipedema. J Dtsch Dermatol Ges 2024; 22(9): 1303–1315. https://doi.org/10.1111/ddg.15513

Kara

Ozduran

Kara

, et al. Evaluating the readability, quality, and reliability of responses generated by ChatGPT, Gemini, and perplexity on the most commonly asked questions about ankylosing spondylitis. PLoS One 2025; 20(6): e0326351. https://doi.org/10.1371/journal.pone.0326351

Bashah

Salem

Al-Waqeerah

, et al. Evaluation of deepseek, gemini, ChatGPT-4o, and perplexity in responding to salivary gland cancer. BMC Oral Health 2025; 25(1): 1358. https://doi.org/10.1186/s12903-025-06726-4

10.

Aljamaan

Temsah

Altamimi

, et al. Reference hallucination score for medical artificial intelligence Chatbots: development and usability study. JMIR Med Inform 2024; 12: e54345. https://doi.org/10.2196/54345

11.

Özbek

İC

Bağcıer

. Reference hallucination in AI-Assisted academic writing: a comparative analysis of ChatGPT, Gemini, and perplexity in rotator cuff literature. JOIO. 2026. https://doi.org/10.1007/s43465-026-01807-0

12.

Özbek

İC

Hancı

Özduran

. Digital guidance: quality and readability analysis of artificial intelligence-generated spondyloarthropathy texts. Turk J Osteoporos 2025; 31(1): 12–18. https://doi.org/10.4274/tod.galenos.2024.76743

13.

Liu

Wright

Patterson

, et al. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J Am Med Inform Assoc 2023; 30(7): 1237–1245. https://doi.org/10.1093/jamia/ocad072

14.

Ozduran

Hancı

Erkin

, et al. Assessing the readability, quality and reliability of responses produced by ChatGPT, Gemini, and perplexity regarding most frequently asked keywords about low back pain. PeerJ 2025; 13: e18847. https://doi.org/10.7717/peerj.18847

15.

Umay

. Dr ChatGPT: is it a reliable and useful source for common rheumatic diseases? Int J Rheum Dis 2023; 26(7): 1343–1349. https://doi.org/10.1111/1756-185X.14749

16.

Dabbas

Odeibat

Alhazaimeh

, et al. Accuracy of ChatGPT in neurolocalization. Cureus 2024; 16(4): e59143. https://doi.org/10.7759/cureus.59143

17.

Stormacq

Wosinski

Boillat

, et al. Effects of health literacy interventions on health-related outcomes in socioeconomically disadvantaged adults living in the community: a systematic review. JBI Evid Synth 2020; 18(7): 1389–1469. https://doi.org/10.11124/JBISRIR-D-18-00023

18.

Zaghloul

Fanous

Ahmed

, et al. Digital health literacy in patients with common chronic diseases: systematic review and meta-analysis. J Med Internet Res 2025; 25: e56231. https://doi.org/10.2196/56231

19.

Posso

Escobar-Domingo

Mustoe

, et al. Quality assessment of online health resources for lipedema: a multimetric analysis. Phlebology 2025; 41(5): 365–372. https://doi.org/10.1177/02683555251372218

20.

Esen

ÖE

Borman

Mete Civelek

, et al. YouTube as a source of information on lipedema: property, quality, and reliability assessment. Lymphatic Res Biol 2023; 21(4): 403–409. https://doi.org/10.1089/lrb.2022.0028

21.

Çiftkaya

PÖ

Bucak

ÖF

Ayan

, et al. Assessment of YouTube™ videos on lipoedema: quality, reliability, and educational gaps in a lymphatic disorder. Phlebology. 2026; 2683555261424065. https://doi.org/10.1177/02683555261424065

22.

Tran

BNN

Singh

Lee

, et al. Readability, complexity, and suitability analysis of online lymphedema resources. J Surg Res 2017; 213: 251–260. https://doi.org/10.1016/j.jss.2017.02.056

23.

Seth

Vargas

Chuang

, et al. Readability assessment of patient information about lymphedema and its treatment. Plast Reconstr Surg 2016; 137(2): 287e–295e. https://doi.org/10.1097/01.prs.0000475747.95096.ab

24.

Liao

Zhao

. A readability analysis of patient education materials about chronic venous disease provided by professional vascular societies. Phlebology 2023; 38(8): 556–560. https://doi.org/10.1177/02683555231190454

25.

Haidar

Jaques

McCaughran

, et al. AI-Generated information for vascular patients: assessing the standard of procedure-specific information provided by the ChatGPT AI-Language model. Cureus 2023; 15(11): e49764. https://doi.org/10.7759/cureus.49764

26.

Cetin

Demir

. Assessing the knowledge of ChatGPT and Google Gemini in answering peripheral artery disease-related questions. Vascular 2025; 33(6): 1282–1287. https://doi.org/10.1177/17085381251315999

27.

Yilmaz

Yeşilkaya

. Evaluating the reliability and guideline concordance of ChatGPT-5 in the management of vascular diseases: a cross-sectional expert-based assessment. J Cardiovasc Surg (Torino). 2026. https://doi.org/10.23736/S0021-9509.26.13536-8

28.

Maraş

Sürme

Topan

. Understandability and actionability of artificial intelligence-assisted lymphedema education material in patients undergoing breast cancer surgery: expert evaluation. J Clin Nurs. 2025. https://doi.org/10.1111/jocn.70123

29.

Özbek

İC

Özduran

. Digital rehabilitation in Parkinson’s disease: the role of artificial intelligence-assisted exercise training. Turk J Osteoporos 2025. Published online Sep 12. https://doi.org/10.4274/tod.galenos.2025.66664

30.

Özbek

İC

. Evaluation of the quality, reliability, and popularity of YouTube videos on thoracic outlet syndrome: a critical analysis. Turkiye Klinikleri J Phy Med Rehabil Sci 2025; 28(3): 207–217. https://doi.org/10.31609/jpmrs.2024-103634

31.

Özbek

İC

. Evaluation of artificial intelligence-supported osteoarthritis information texts: content quality and readability analysis. Turkiye Klinikleri J Phy Med Rehabil Sci 2025; 28(1): 21–29. https://doi.org/10.31609/jpmrs.2024-103532

32.

Özbek

İC

Özduran

Hancı

. Artificial intelligence in pain: a comprehensive review. Med J West Black Sea 2026; 10(1): 168–178. https://doi.org/10.29058/mjwbs.1881313