Onur Gultekin1, Michael T. Hirschmann2,3, Halil İbrahim Arıkan4, Bekir Eray Kilinc1, Baris Yilmaz1, Süleyman Abul5, Jumpei Inoue6, Mahmut Enes Kayaalp1,7,8

1Department of Orthopedics and Traumatology, University of Health Sciences, İstanbul Fatih Sultan Mehmet Training and Research Hospital, İstanbul, Türkiye
2Department of Orthopedic Surgery and Traumatology, Kantonsspital Baselland, Bruderholz, Switzerland
3Department of Clinical Research, Research Group Michael T. Hirschmann, Regenerative Medicine & Biomechanics, University of Basel, Basel, Switzerland
4Department of Orthopedics and Traumatology, Viranşehir State Hospital, Şanlıurfa, Türkiye
5Department of Orthopedics and Traumatology, İstanbul Kartal Dr. Lütfi Kırdar City Hospital, İstanbul, Türkiye
6Department of Orthopaedic Surgery, Nagoya Tokushukai General Hospital, Kasugai, Aichi, Japan
7Department of Orthopaedics and Traumatology, University Hospital Brandenburg/Havel, Brandenburg Medical School Theodor Fontane, Brandenburg/Havel, Germany
8Brandenburg Medical School Theodor Fontane, Faculty of Health Sciences Brandenburg, Brandenburg/Havel, Germany

Keywords: Agentic artificial intelligence, large language model, readability, retrieval augmented generation, tool augmentation, total knee arthroplasty.

Abstract

Objectives: This study aims to directly compare ChatGPT and DeepSeek, both equipped with DeepResearch/DeepThink capabilities, based on their responses to frequently asked questions (FAQs) on total knee arthroplasty (TKA).

Materials and methods: Thirty frequently asked questions related to TKA were compiled from validated patient education sources, including American Academy of Orthopaedic Surgeons (AAOS) OrthoInfo, National Institute for Health and Care Excellence (NICE) guidelines, and popular patient discussion forums, and verified for clinical relevance by two independent arthroplasty surgeons. Two orthopedic surgeons, blinded to model identity, evaluated each response using a five-point Likert scale across five domains: accuracy, comprehensiveness, readability, relevance, and ethical and safety considerations. The maximum total score per response was 25. Readability was also assessed using the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES). Inter-rater and intra-rater reliability were calculated using intraclass correlation coefficients (ICCs).

Results: The ChatGPT-4o scored significantly higher in comprehensiveness and clinical detail, whereas DeepSeek R1 produced responses with superior readability, indicated by a lower FKGL (7.5 vs. 10.2) and higher FRES (62.3 vs. 45.6) (p < 0.05). Both models demonstrated high accuracy and safety, with no factual errors identified. Intra-rater reliability was excellent (ICC > 0.81), and inter-rater agreement ranged from fair to substantial (ICC 0.31 to 0.63).

Conclusion: Both ChatGPT-4o and DeepSeek R1 are capable of generating accurate, ethically sound, and clinically relevant educational content for patients undergoing TKA. While ChatGPT-4o offers more comprehensive information, DeepSeek R1 provides content that is more accessible to patients with lower health literacy. Model selection should be tailored to the target population to optimize educational effectiveness in clinical practice. The ability of real-time data retrieval to incorporate the most current clinical evidence and guideline updates may further enhance the educational quality, reliability, and clinical relevance of AI-generated patient information.

Citation: Gultekin O, Hirschmann MT, Arıkan Hİ, Kilinc BE, Yilmaz B, Abul S, et al. Evaluating deepresearch and deepthink in total knee arthroplasty patient education: ChatGPT-4o excels in comprehensiveness, Deepseek R1 leads i230n clarity and readability of orthopedic information. Jt Dis Relat Surg 2026;37(2):470-476. doi: 10.52312/jdrs.2026.2645.