Appl Clin Inform 2024; 15(05): 1049-1055
DOI: 10.1055/a-2405-0138
Special Topic on Teaching and Training Future Health Informaticians

ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions

Tessa Danehy*
1   Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States
,
Jessica Hecht*
1   Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States
,
Sabrina Kentis
1   Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States
,
Clyde B. Schechter
2   Department of Family and Social Medicine, Albert Einstein College of Medicine, Bronx, New York, United States
,
Sunit P. Jariwala
3   Division of Allergy/Immunology, Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States
› Author Affiliations
Funding None.

Abstract

Objectives The main objective of this study is to evaluate the ability of the Large Language Model Chat Generative Pre-Trained Transformer (ChatGPT) to accurately answer the United States Medical Licensing Examination (USMLE) board-style medical ethics questions compared to medical knowledge-based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and assessing the variability of responses given by each version.

Methods Using AMBOSS, a third-party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4 and recorded the output. A random-effects linear probability regression model evaluated accuracy and a Shannon entropy calculation evaluated response variation.

Results Both versions of ChatGPT demonstrated worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points (p < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points (p = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points (p < 0.001) on medical ethics and 33% points (p < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55, respectively) which indicates lower variability in response.

Conclusion Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.

Protection of Human and Animal Subjects

No human or animal subjects were included in this project.


* These authors contributed equally to the manuscript.


Supplementary Material



Publication History

Received: 01 May 2024

Accepted: 27 August 2024

Accepted Manuscript online:
29 August 2024

Article published online:
04 December 2024

© 2024. Thieme. All rights reserved.

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany