Appl Clin Inform 2026; 17(01): 064-072
DOI: 10.1055/a-2807-4256
Research Article

Measuring the Accuracy and Reproducibility of DeepSeek R1, Claude 3.5 Sonnet, and GPT-4.1 on Complex Clinical Scenarios

Authors

  • Robert E. Hoyt

    1   Internal Medicine, Virginia Commonwealth University, Richmond, Virginia, United States
  • Maria Bajwa

    2   Department of Health Professions Education (HPEd), School of Health & Rehabilitation Sciences, MGH Institute of Health Professions, Boston, Massachusetts, United States

Abstract

Background

The integration of large language models (LLMs) into clinical diagnostics presents significant challenges regarding their accuracy and reliability.

Objectives

This study aimed to evaluate the performance of DeepSeek R1, an open-source reasoning model, alongside two other LLMs, GPT-4.1 and Claude 3.5 Sonnet, across multiple-choice clinical cases.

Methods

A dataset of complex medical cases representative of real-world clinical practice was selected.

For efficiency, models were accessed via application programming interfaces (APIs) and assessed using standardized prompts and a predefined evaluation protocol.

Results

The models demonstrated an overall accuracy of 77.1%, with GPT-4 producing the fewest errors and Claude 3.5 the most. The reproducibility analysis indicated that the tests were very repeatable: DeepSeek (100%), GPT-4.1 (97.5%), and Claude 3.5 Sonnet (92%).

Conclusion

While LLMs show promise for enhancing diagnostics, ongoing scrutiny is required to address error rates and validate standard medical answers. Given the limited dataset and prompting protocol, findings should not be interpreted as broader equivalence in real-world clinical reasoning. This study demonstrates the need for robust evaluation standards, attention to error rates, and further research.

Protection of Human and Animal Subjects

No human subjects were involved in our research. For that reason, our study was not submitted to an Institutional Review Board (IRB).


Declaration of GenAI Use

During the writing process of this paper, the author(s) used QuillBot in order to improve spelling and syntax. The author(s) reviewed and edited the text and take(s) full responsibility for the content of the paper.


Data Availability Statement

The Python notebook, dataset, and readme files are available on GitHub in the MMLU-Pro-Project repository ( https://github.com/rehoyt/MMLU-Pro-Project.git ).




Publication History

Received: 09 October 2025

Accepted: 05 February 2026

Accepted Manuscript online:
09 February 2026

Article published online:
20 February 2026

© 2026. Thieme. All rights reserved.

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany