Assessment of large language models in musculoskeletal radiological anatomy: A comparative study with radiologists

Ali Salbas; Ebru Kul Baysan

doi:10.52312/jdrs.2026.2436

Ali Salbas¹, Ebru Kul Baysan²

¹Department of Radiology, Atatürk Training and Research Hospital, İzmir, Türkiye
²Department of Physical and Rehabilitation Medicine, Atatürk Training and Research Hospital, İzmir, Türkiye

Keywords: Artificial intelligence, large language models, medical education, musculoskeletal radiology, radiological anatomy.

Abstract

Objectives: This study aims to evaluate the diagnostic performance of large language models (LLMs) in musculoskeletal radiological anatomy and to compare their accuracy with radiologists of varying experience levels.

Patients and methods: Between May 16, 2025 and June 12, 2025, a total of 175 multiple-choice questions (82 image-based, 93 text-only) were retrieved from Radiopaedia’s open-access database. Questions were classified by anatomical region and imaging modality. Three LLMs, ChatGPT-4o (OpenAI), Claude 3.7 Sonnet (Anthropic), and Grok 3 (×AI), were assessed in a zero-shot setting. Their responses were compared to those of an attending musculoskeletal radiologist and two residents (senior and junior). Accuracy rates were calculated and statistically compared.

Results: The attending radiologist achieved the highest overall accuracy (79.4%), followed by the senior (72.6%) and junior resident (66.9%). Among LLMs, ChatGPT-4o performed best overall (69.7%), particularly in text-based questions (88.2%). All LLMs outperformed radiologists in text-based questions but underperformed in image-based ones. The attending radiologist significantly outperformed all LLMs in image interpretation (p<0.001). Variations in performance were also noted across anatomical regions and imaging modalities, with some LLMs exceeding radiologists in specific domains such as spinal or shoulder anatomy.

Conclusion: While LLMs, particularly ChatGPT-4o, show strong performance in text-based anatomical questions, their accuracy in image-based musculoskeletal radiology remains limited compared to human radiologists. These findings suggest that LLMs can serve as supplementary tools in education but require further optimization, particularly for visual interpretation tasks, before clinical implementation.

Citation: Salbas A, Kul Baysan E. Assessment of large language models in musculoskeletal radiological anatomy: A comparative study with radiologists. Jt Dis Relat Surg 2026;37(1):190-199. doi: 10.52312/jdrs.2026.2436.

Author Contributions

Contributed to study conception and design, data collection, analysis, literature review, drafting of the manuscript, and preparation of references: A.S.; Contributed to study design, data collection, supervision, analysis, preparation of references, and critical revision of the manuscript: E.K.B. Both authors reviewed and approved the final version of the manuscript.

Conflict of Interest

The authors declared no conflicts of interest with respect to the authorship and/or publication of this article.

Financial Disclosure

The authors received no financial support for the research and/or authorship of this article.

Acknowledgments

We sincerely thank Murat Yoğurtcu, MD (attending radiologist), Berçim Sarı Bostancı, MD (senior radiology resident), and Yaren Yazar, MD (junior radiology resident), for their valuable contributions as independent evaluators in this study. Their efforts in carefully and accurately answering musculoskeletal radiological anatomy questions were essential to the comparative analysis.

Data Sharing Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.