Subscribe to RSS
DOI: 10.1055/a-2807-4256
Measuring the Accuracy and Reproducibility of DeepSeek R1, Claude 3.5 Sonnet, and GPT-4.1 on Complex Clinical Scenarios
Authors
Abstract
Background
The integration of large language models (LLMs) into clinical diagnostics presents significant challenges regarding their accuracy and reliability.
Objectives
This study aimed to evaluate the performance of DeepSeek R1, an open-source reasoning model, alongside two other LLMs, GPT-4.1 and Claude 3.5 Sonnet, across multiple-choice clinical cases.
Methods
A dataset of complex medical cases representative of real-world clinical practice was selected.
For efficiency, models were accessed via application programming interfaces (APIs) and assessed using standardized prompts and a predefined evaluation protocol.
Results
The models demonstrated an overall accuracy of 77.1%, with GPT-4 producing the fewest errors and Claude 3.5 the most. The reproducibility analysis indicated that the tests were very repeatable: DeepSeek (100%), GPT-4.1 (97.5%), and Claude 3.5 Sonnet (92%).
Conclusion
While LLMs show promise for enhancing diagnostics, ongoing scrutiny is required to address error rates and validate standard medical answers. Given the limited dataset and prompting protocol, findings should not be interpreted as broader equivalence in real-world clinical reasoning. This study demonstrates the need for robust evaluation standards, attention to error rates, and further research.
Protection of Human and Animal Subjects
No human subjects were involved in our research. For that reason, our study was not submitted to an Institutional Review Board (IRB).
Declaration of GenAI Use
During the writing process of this paper, the author(s) used QuillBot in order to improve spelling and syntax. The author(s) reviewed and edited the text and take(s) full responsibility for the content of the paper.
Data Availability Statement
The Python notebook, dataset, and readme files are available on GitHub in the MMLU-Pro-Project repository ( https://github.com/rehoyt/MMLU-Pro-Project.git ).
Publication History
Received: 09 October 2025
Accepted: 05 February 2026
Accepted Manuscript online:
09 February 2026
Article published online:
20 February 2026
© 2026. Thieme. All rights reserved.
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany
-
References
- 1 Zhang K, Meng X, Yan X. et al. Revolutionizing health care: The transformative impact of large language models in medicine. J Med Internet Res 2025; 27: e59069
- 2 Meng X, Yan X, Zhang K. et al. The application of large language models in medicine: A scoping review. iScience 2024; 27 (05) 109713
- 3 Andrew A. Potential applications and implications of large language models in primary care. Fam Med Community Health 2024; 12 (Suppl. 01) e002602
- 4 Yim D, Khuntia J, Parameswaran V, Meyers A. Preliminary evidence of the use of generative AI in health care clinical services: Systematic narrative review. JMIR Med Inform 2024; 12: e52073
- 5 Maity S, Saikia MJ. Large language models in healthcare and medical applications: A review. Bioengineering (Basel) 2025; 12 (06) 631
- 6 Glicksberg BS, Timsina P, Patel D. et al. Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. J Am Med Inform Assoc 2024; 31 (09) 1921-1928
- 7 Tarabanis C, Zahid S, Mamalis M, Zhang K, Kalampokis E, Jankelson L. Performance of publicly available large language models on internal medicine board-style questions. PLoS Digit Health 2024; 3 (09) e0000604
- 8 Berger A, Khanna S, Berghaus D, Sifa R. Reasoning LLMs in the Medical Domain: A Literature Survey. arXiv preprint. arXiv:2508.19097 [csAI] 2025 . Accessed September 6, 2025 at: https://arxiv.org/html/2508.19097v1
- 9 Hoyt RE, Knight D, Haider M, Bajwa M. Evaluating a large reasoning model's performance on open-ended medical scenarios. medRxiv 2025; (e-pub ahead of print)
- 10 Hoyt RE, Knight D, Haider M, Bajwa M. Evaluating large reasoning model performance on complex medical scenarios in the MMLU-Pro benchmark. medRxiv 2025:2025.04.07.25325385
- 11 Bedi S, Liu Y, Orr-Ewing L. et al. Testing and evaluation of health care applications of large language models: A systematic review. JAMA 2025; 333 (04) 319-328
- 12 Pan Z, Pei Q, Li Y. et al. REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once. arXiv:2507.10541 [csCL] 2025 . Accessed August 10, 2025 at: https://arxiv.org/abs/2507.10541
- 13 Papers with Code - MMLU Benchmark (Multi-Task Language Understanding). Accessed March 6. 2025 at https://www.kaggle.com/benchmarks/open-benchmarks/mmlu
- 14 Wang Y, Ma X, Zhang G. et al. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. arXiv:2406.01574 [cs.CL] 2024 . Accessed June 10, 2025 at: http://arxiv.org/abs/2406.01574
- 15 GitHub. MMLU-Pro: The Code and Data for “MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark” [NeurIPS 2024]. Accessed March 6, 2025 at: https://github.com/TIGER-AI-Lab/MMLU-Pro
- 16 TIGER-Lab. MMLU-pro Leaderboard - a Hugging Face Space by TIGER-Lab. Accessed March 6, 2025 at: https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro
- 17 Desjardins I, Touchie C, Pugh D, Wood TJ, Humphrey-Murto S. The impact of cueing on written examinations of clinical decision making: A case study. Med Educ 2014; 48 (03) 255-261
- 18 Willing S, Ostapczuk M, Musch J. Do sequentially-presented answer options prevent the use of testwiseness cues on continuing medical education tests?. Adv Health Sci Educ Theory Pract 2015; 20 (01) 247-263
- 19 Gema HR, Pradipta CR. et al. Inverse scaling in test-time compute. arXiv:2507.14417 [cs.AI] 2025 . Accessed August 20, 2025 at: http://arxiv.org/abs/2507.14417
- 20 Gallifant J, Afshar M, Ameen S. et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med 2025; 31 (01) 60-69
- 21 Perplexity AI. Homepage. Accessed July 2, 2025 at: https://www.perplexity.ai/
- 22 Artificial Analysis. DeepSeek R1 0528 (May '25) Intelligence, Performance & Price Analysis. 2025 . Accessed March 6, 2025 at: https://artificialanalysis.ai/models/deepseek-r1
- 23 Guo D, Yang D, Zhang H. et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948 [cs.CL] 2025 . Accessed August 28, 2025 at: http://arxiv.org/abs/2501.12948
- 24 Bronsdon C. The Complete Guide to Anthropic Claude 3.5 Sonnet. Galileo AI. Accessed September 6, 2025 at: https://galileo.ai/blog/claude-3-5-sonnet-complete-guide-ai-capabilities-analysis
- 25 Anthropic. Claude 3.5 Sonnet. Accessed July 5, 2025 at: https://claude.ai
- 26 Open AI. Introducing GPT-4.1 in the API. OpenAI, April 2025. Available from: https://openai.com/index/gpt-4-1/ Accessed September 6, 2025
- 27 PromptHub. The Complete Guide to GPT-4.1: Models, Performance, Pricing, and Prompting Tips. April 24, 2025. Accessed September 6, 2025 at: https://www.prompthub.us/blog/the-complete-guide-to-gpt-4-1-models-performance-pricing-and-prompting-tips
- 28 OpenAI. GPT-4.1. Accessed July 6, 2025 at: https://platform.openai.com/docs/models/gpt-4.1
- 29 Colab. Welcome to Colab. Accessed July 8, 2025 at: https://colab.research.google.com/
- 30 Nachane S, Gramopodhye O, Chanda P. et al. Few-Shot Chain-of-Thought Driven Reasoning to Prompt LLMs for Open-Ended Medical Question Answering. arXiv:2403.04890 [cs.CL] 2024 . Accessed May 13, 2025 at: http://arxiv.org/abs/2403.04890
- 31 Dietterich T.G. Approximate Statistical Tests For Comparing Supervised Classification Learning Algorithms. Neural Comput 1998; 10 (07) 1895-1923
- 32 Manus AI. Accessed July 9, 2025 at: https://manus.im
- 33 MedCalc. Accessed July 9, 2025 at: https://medcalc.com
- 34 Sundjaja JH, Shrestha R, Krishan K. McNemar and Mann-Whitney U Tests. StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2023. . Accessed September 6, 2025 at: https://www.ncbi.nlm.nih.gov/books/NBK560699/
- 35 Borges RM. Editorial: Reproducibility and replicability in science: A Sisyphean task. J Biosci 2022; 47 (01) 15
- 36 Gema AP, Leang JOJ, Hong G. et al. Are we done with MMLU?. Paper presented at: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. Vol 1. Association for Computational Linguistics; 2025:pp. 5069–5096
- 37 Mykhalko YO, Filak YF, Dutkevych-Ivanska Y. et al. From Open-Ended to Multiple-Choice: Evaluating Diagnostic Performance and Consistency of ChatGPT, Google Gemini and Claude AI. Wiadomosci Lekarskie (Warsaw, Poland: 1960), vol. 77, no. 10, ALUNA, 2024:pp. 1852–1856. Accessed June 12, 2025 at: https://www.wiadomoscilekarskie.pl/From-open-ended-to-multiple-choice-evaluating-diagnostic-performance-and-consistency,195125,0,2.html
- 38 Tordjman M, Liu Z, Yuce M. et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat Med 2025; 31 (08) 2550-2555
- 39 Wu J, Wang Z, Qin Y. Performance of DeepSeek-R1 and ChatGPT-4o on the Chinese National Medical Licensing Examination: A comparative study. J Med Syst 2025; 49 (01) 74
- 40 Mansoor M, Ibrahim A, Hamide A. Performance of DeepSeek and GPT Models on Pediatric Board Preparation Questions: Comparative evaluation. JMIR AI 2025; 4: e76056
- 41 Ali R, Shi L, Cui H. A Comparative Study on the Use of DeepSeek-R1 and ChatGPT-4.5 in Different Aspects of Plastic Surgery. Aesthetic Plast Surg 2025; (Epub ahead of print)
- 42 Jin I, Tangsrivemol J, Darzi E. et al. DeepSeek vs. ChatGPT: Prospects and Challenges. Front Artif Intell 2025; 8: 1576992
- 43 Jin D, Pan E, Offatole M. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl Sci (Basel) 2021; 11 (14) 6421
- 44 Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 2023; 330 (01) 78-80
- 45 Chen H, Fang Z, Single Y. et al. Benchmarking large language models on answering and explaining challenging medical questions. arXiv:2402.18060 [cs.CL] 2024 . Accessed August 15, 2025 at: http://arxiv.org/abs/2402.18060
- 46 Truong ST, Tu Y, Hardy M. et al. Fantastic bugs and where to find them in AI benchmarks. December 13, 2025. Accessed December 21, 2025 at: https://ai.stanford.edu/blog/fantastic-bugs
- 47 Christof M, Armoundas AA. Implications of integrating large language models into clinical decision making. Commun Med (Lond) 2025; 5 (01) 490
- 48 Blease C, Hagström J, Sanchez CG. et al. General practitioners' adoption of generative artificial intelligence in clinical practice in the UK: An updated online survey. Digit Health 2025;11(20552076251394287):20552076251394287
- 49 Dorfner F, Dada A, Busch F. et al. Biomedical Large Language Models Seem Not to Be Superior to Generalist Models on Unseen Medical Data. arXiv:2408.13833 [cs.CL] 2024 . Accessed August 1, 2025 at: http://arxiv.org/abs/2408.13833
- 50 Griot M, Hemptinne C, Vanderdonckt J, Yuksel D. Large language models lack essential metacognition for reliable medical reasoning. Nat Commun 2025; 16 (01) 642
- 51 Holistic Evaluation of Language Models (HELM). Accessed March 9, 2025 at: https://crfm.stanford.edu/helm/
- 52 Tam TYC, Sivarajkumar S, Kapoor S. et al. A framework for human evaluation of large language models in healthcare derived from literature review. NPJ Digit Med 2024; 7 (01) 258
- 53 Arora RK, Wei J, Hicks RS. et al. HealthBench: Evaluating large language models towards improved human health. arXiv:2505.08775 [csCL] 2025 . Accessed August 3, 2025 at: http://arxiv.org/abs/2505.08775
- 54 Wang Z, Li H, Huang D. et al. HealthQ: Unveiling questioning capabilities of LLM chains in healthcare conversations. Smart Health 2025; 36: 100570
