ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions

Tessa Danehy; Jessica Hecht; Sabrina Kentis; Clyde B. Schechter; Sunit P. Jariwala

doi:10.1055/a-2405-0138

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035026.xml

Share / Bookmark

Facebook Linkedin Weibo

Download PDF

Appl Clin Inform 2024; 15(05): 1049-1055
DOI: 10.1055/a-2405-0138

Special Topic on Teaching and Training Future Health Informaticians

ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions

Tessa Danehy^*

¹Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States

,

Jessica Hecht^*

¹Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States

,

Sabrina Kentis

¹Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States

,

Clyde B. Schechter

²Department of Family and Social Medicine, Albert Einstein College of Medicine, Bronx, New York, United States

,

Sunit P. Jariwala

³Division of Allergy/Immunology, Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States

› Author Affiliations
Funding None.

› Further Information

Also available at

Abstract
Full Text
References
Supplementary Material

Permissions and Reprints

Abstract

Objectives The main objective of this study is to evaluate the ability of the Large Language Model Chat Generative Pre-Trained Transformer (ChatGPT) to accurately answer the United States Medical Licensing Examination (USMLE) board-style medical ethics questions compared to medical knowledge-based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and assessing the variability of responses given by each version.

Methods Using AMBOSS, a third-party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4 and recorded the output. A random-effects linear probability regression model evaluated accuracy and a Shannon entropy calculation evaluated response variation.

Results Both versions of ChatGPT demonstrated worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points (p < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points (p = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points (p < 0.001) on medical ethics and 33% points (p < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55, respectively) which indicates lower variability in response.

Conclusion Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.

Keywords

ChatGPT - large language model - artificial intelligence - medical education - USMLE - ethics

Protection of Human and Animal Subjects

No human or animal subjects were included in this project.

^* These authors contributed equally to the manuscript.

Supplementary Material

Supplementary Material

Publication History

Received: 01 May 2024

Accepted: 27 August 2024

Accepted Manuscript online:
29 August 2024

Article published online:
04 December 2024

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

References
1 Open AI. Product. 2024 . Accessed September 3, 2024 at: https://openai.com/product

PubMed Search in Google Scholar
2 Baktash J, Dawodi M. Gpt-4: A review on advancements and opportunities in natural language processing. arXiv 2023; ;abs/2305.03195.

Crossref PubMed Search in Google Scholar
3 Chen L, Zaharia M, Zou J. . “How is ChatGPT's behavior changing over time?” arXiv:2307.09009. 2023

PubMed
4 Spotnitz M, Idnay B, Gordon ER. et al. A survey of clinicians' views of the utility of large language models. Appl Clin Inform 2024; 15 (02) 306-312

Thieme Connect PubMed Search in Google Scholar
5 Gilson A, Safranek CW, Huang T. et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of Large Language Models for Medical Education and Knowledge Assessment published correction appears in JMIR. Med Educ 2024; 10: e57594

PubMed Search in Google Scholar
6 United States Medical Licensing Examination | Bulletin of Information. Scoring and Score Reporting. Accessed September 3, 2024 at: https://www.usmle.org/bulletin-information/scoring-and-score-reporting

PubMed
7 Tsang R. Practical applications of ChatGPT in undergraduate medical education. J Med Educ Curric Dev 2023; 10: 23 821205231178449

Crossref PubMed Search in Google Scholar
8 Miao J, Thongprayoon C, Garcia Valencia OA. et al. Performance of ChatGPT on nephrology test questions. Clin J Am Soc Nephrol 2023; 19: 35-43

Crossref PubMed Search in Google Scholar
9 Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a Radiology Board-style Examination: insights into current strengths and limitations. Radiology 2023; 307 (05) e230582

Crossref PubMed Search in Google Scholar
10 Gupta R, Herzog I, Park JB. et al. Performance of ChatGPT on the Plastic Surgery Inservice Training Examination. Aesthet Surg J 2023; 43 (12) NP1078-NP1082

Crossref PubMed Search in Google Scholar
11 Suchman K, Garg S, Trindade AJ. Chat generative pretrained transformer fails the multiple-choice American College of Gastroenterology Self-Assessment Test. Am J Gastroenterol 2023; 118 (12) 2280-2282

Crossref PubMed Search in Google Scholar
12 Lekka-Kowalik A. Morality in the AI world. Law and Business. 2021; 1: 44-49

Crossref PubMed Search in Google Scholar
13 AMBOSS. Medical knowledge platform for doctors and students. Accessed March 1, 2024 at: https://www.amboss.com/us

PubMed
14 Shannon CE. A mathematical theory of communication. Bell Syst Tech J 1948; 27 (03) 379-423

Crossref PubMed Search in Google Scholar
15 Krügel S, Ostermaier A, Uhl M. ChatGPT's inconsistent moral advice influences users' judgment. Sci Rep 2023; 13 (01) 4569

Crossref PubMed Search in Google Scholar
16 Long D, Liang R. Critical thinking: An ethical quality of citizens in modern society. Paper presented at: Proceedings of the 4th International Conference on Culture, Education and Economic Development of Modern Society (ICCESE 2020). 2020 . Accessed September 3, 2024 at: https://doi.org/10.2991/assehr.k.200316.137

Crossref PubMed Search in Google Scholar
17 Mihalache A, Huang RS, Popovic MM, Muni RH. ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med Teach 2024; 46 (03) 366-372

Crossref PubMed Search in Google Scholar
18 Suárez A, García V, Algar J, Sánchez M, Pedro M, Freire Y. Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers. Int Endod J 2023; 57: 108-113

Crossref PubMed Search in Google Scholar
19 National Board of Medical Examiners. USMLE Step 1 Practice Exam. Accessed February 2024 at: https://orientation.nbme.org/launch/usmle/stpf1

PubMed
20 Meskó B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J Med Internet Res 2023; 25: e50638

Crossref PubMed Search in Google Scholar

Supplementary Material

Supplementary Material

Subscribe to RSS

Share / Bookmark

ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions

Abstract

Keywords

Protection of Human and Animal Subjects

Supplementary Material

Publication History

References