Evaluating deepresearch and deepthink in total knee arthroplasty patient education: ChatGPT-4o excels in comprehensiveness, Deepseek R1 leads in clarity and readability of orthopedic information

Onur Gultekin; Michael T. Hirschmann2; 3; Halil İbrahim Arıkan; Bekir Eray Kilinc; Baris Yilmaz; Süleyman Abul; Jumpei Inoue; Mahmut Enes Kayaalp1; 7; 8

doi:10.52312/jdrs.2026.2645

Onur Gultekin¹, Michael T. Hirschmann^2,3, Halil İbrahim Arıkan⁴, Bekir Eray Kilinc¹, Baris Yilmaz¹, Süleyman Abul⁵, Jumpei Inoue⁶, Mahmut Enes Kayaalp^1,7,8

¹Department of Orthopedics and Traumatology, University of Health Sciences, İstanbul Fatih Sultan Mehmet Training and Research Hospital, İstanbul, Türkiye
²Department of Orthopedic Surgery and Traumatology, Kantonsspital Baselland, Bruderholz, Switzerland
³Department of Clinical Research, Research Group Michael T. Hirschmann, Regenerative Medicine & Biomechanics, University of Basel, Basel, Switzerland
⁴Department of Orthopedics and Traumatology, Viranşehir State Hospital, Şanlıurfa, Türkiye
⁵Department of Orthopedics and Traumatology, İstanbul Kartal Dr. Lütfi Kırdar City Hospital, İstanbul, Türkiye
⁶Department of Orthopaedic Surgery, Nagoya Tokushukai General Hospital, Kasugai, Aichi, Japan
⁷Department of Orthopaedics and Traumatology, University Hospital Brandenburg/Havel, Brandenburg Medical School Theodor Fontane, Brandenburg/Havel, Germany
⁸Brandenburg Medical School Theodor Fontane, Faculty of Health Sciences Brandenburg, Brandenburg/Havel, Germany

Keywords: Agentic artificial intelligence, large language model, readability, retrieval augmented generation, tool augmentation, total knee arthroplasty.

Abstract

Objectives: This study aims to directly compare ChatGPT and DeepSeek, both equipped with DeepResearch/DeepThink capabilities, based on their responses to frequently asked questions (FAQs) on total knee arthroplasty (TKA).

Materials and methods: Thirty frequently asked questions related to TKA were compiled from validated patient education sources, including American Academy of Orthopaedic Surgeons (AAOS) OrthoInfo, National Institute for Health and Care Excellence (NICE) guidelines, and popular patient discussion forums, and verified for clinical relevance by two independent arthroplasty surgeons. Two orthopedic surgeons, blinded to model identity, evaluated each response using a five-point Likert scale across five domains: accuracy, comprehensiveness, readability, relevance, and ethical and safety considerations. The maximum total score per response was 25. Readability was also assessed using the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES). Inter-rater and intra-rater reliability were calculated using intraclass correlation coefficients (ICCs).

Results: The ChatGPT-4o scored significantly higher in comprehensiveness and clinical detail, whereas DeepSeek R1 produced responses with superior readability, indicated by a lower FKGL (7.5 vs. 10.2) and higher FRES (62.3 vs. 45.6) (p < 0.05). Both models demonstrated high accuracy and safety, with no factual errors identified. Intra-rater reliability was excellent (ICC > 0.81), and inter-rater agreement ranged from fair to substantial (ICC 0.31 to 0.63).

Conclusion: Both ChatGPT-4o and DeepSeek R1 are capable of generating accurate, ethically sound, and clinically relevant educational content for patients undergoing TKA. While ChatGPT-4o offers more comprehensive information, DeepSeek R1 provides content that is more accessible to patients with lower health literacy. Model selection should be tailored to the target population to optimize educational effectiveness in clinical practice. The ability of real-time data retrieval to incorporate the most current clinical evidence and guideline updates may further enhance the educational quality, reliability, and clinical relevance of AI-generated patient information.

Citation: Gultekin O, Hirschmann MT, Arıkan Hİ, Kilinc BE, Yilmaz B, Abul S, et al. Evaluating deepresearch and deepthink in total knee arthroplasty patient education: ChatGPT-4o excels in comprehensiveness, Deepseek R1 leads i230n clarity and readability of orthopedic information. Jt Dis Relat Surg 2026;37(2):470-476. doi: 10.52312/jdrs.2026.2645.

Author Contributions

M.E.K., M.T.H., J.I.: Idea/concept, analysis; O.G., H.İ.A., B.E.K., B.Y.: Design; H.İ.A., S.A.: Data collection; O.G., M.E.K.: Literature review; O.G., M.E.K., H.İ.A., S.A.: Writing the article; B.Y., M.E.K., M.T.H., J.I., B.E.K. Critical review.

Conflict of Interest

The authors declared no conflicts of interest with respect to the authorship and/or publication of this article.

Financial Disclosure

The authors received no financial support for the research and/or authorship of this article.

Data Sharing Statement

The datasets generated and analyzed during the current study (AI-generated responses and blinded evaluator scoring sheets) are available from the corresponding author.

AI Disclosure:
The authors declare that artificial intelligence (AI) tools were not used, or were used solely for language editing, and had no role in data analysis, interpretation, or the formulation of conclusions. All scientific content, data interpretation, and conclusions are the sole responsibility of the authors. The authors further confirm that AI tools were not used to generate, fabricate, or ‘hallucinate’ references, and that all references have been carefully verified for accuracy.