Appl Clin Inform 2024; 15(05): 1049-1055
DOI: 10.1055/a-2405-0138
Special Topic on Teaching and Training Future Health Informaticians

ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions

Authors

  • Tessa Danehy*

    1   Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States
  • Jessica Hecht*

    1   Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States
  • Sabrina Kentis

    1   Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States
  • Clyde B. Schechter

    2   Department of Family and Social Medicine, Albert Einstein College of Medicine, Bronx, New York, United States
  • Sunit P. Jariwala

    3   Division of Allergy/Immunology, Albert Einstein College of Medicine, Montefiore Medical Center, Bronx, New York, United States

Funding None.
 

Abstract

Objectives The main objective of this study is to evaluate the ability of the Large Language Model Chat Generative Pre-Trained Transformer (ChatGPT) to accurately answer the United States Medical Licensing Examination (USMLE) board-style medical ethics questions compared to medical knowledge-based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and assessing the variability of responses given by each version.

Methods Using AMBOSS, a third-party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4 and recorded the output. A random-effects linear probability regression model evaluated accuracy and a Shannon entropy calculation evaluated response variation.

Results Both versions of ChatGPT demonstrated worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points (p < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points (p = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points (p < 0.001) on medical ethics and 33% points (p < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55, respectively) which indicates lower variability in response.

Conclusion Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.


Background and Significance

Chat Generative Pre-Trained Transformer (ChatGPT) is a large language model (LLM) developed by OpenAI.[1] GPT-3.5 was released free of charge in November 2022, followed by a paid version, GPT-4, that was released in March 2023. GPT-4 boasts a larger model size, better contextual understanding of free text, and improved reasoning capabilities.[2] After the introduction of a paid version of ChatGPT, reports came out that the free version, GPT-3.5 was getting worse.[3] This highlights the ability of language models to change over time. This change is of important note to the medical community, especially as health care providers become more willing to rely on LLMs for a variety of tasks.[4] Quickly after the introduction of ChatGPT in November 2022, the LLM was assessed on its ability to correctly answer the United States Medical Licensing Examination (USMLE) board-style questions. In 2022, it was demonstrated that ChatGPT-3.5 can answer USMLE-style questions with 60% accuracy, correlating to a passing score on the Step 1 board examination.[5] [6] ChatGPT's sensational ability to pass USMLE Step 1 prompted medical students to utilize ChatGPT as a study tool.[7] However, the evolving nature of ChatGPT raises the concern that despite the highly publicized passing of the USMLE when tested in December 2022, the current version of GPT-3.5 should be reassessed for accuracy. The potential for language models to change is expected; however, if they are changing to become less accurate, this could be counter to user expectations.

In addition to evolving model accuracy, the intrinsic ability of ChatGPT to answer USMLE-style questions accurately may vary across subject areas despite its overall ability to receive a passing score. Previous research shows that ChatGPT's accuracy varies among medical specialties.[8] [9] [10] [11] However, ChatGPT has not yet been assessed specifically on the different subject areas tested on USMLE Step 1. Of particular interest is the area of medical ethics, as these questions often require students to employ a level of moral judgment when selecting the best answer. Subject areas requiring moral judgment may be more challenging for an Artificial Intelligence (AI) model to answer compared to medical knowledge-based questions.[12]

This study aims to evaluate the performance of GPT-3.5 and GPT-4 and compare that performance to medical students in answering USMLE Step 1 style questions sourced from the AMBOSS question bank. AMBOSS is an online USMLE Step 1 test prep resource that contains a large bank of multiple choice questions that mimic USMLE multiple choice questions.[13] In addition, AMBOSS collects data on how often a question is answered correctly and provides metrics on question difficulty and the percentage of students who answered the question correctly. This study's objectives are threefold: (1) compare ChatGPT's ability to answer medical ethics versus medical knowledge board-style multiple choice questions, (2) assess and compare GPT-3.5, GPT-4, and AMBOSS users' overall accuracy in answering USMLE-style test prep questions, and (3) quantify response variability between outputs. We hypothesized that GPT-3.5 and GPT-4 perform more poorly on medical ethics questions compared to medical knowledge questions tested in Step 1. We also hypothesized that GPT-4 would outperform GPT-3.5 on overall accuracy due to its aforementioned improvements, and further hypothesized that ChatGPT's Shannon entropy, which we used to quantify response variability, would be greater than 0.


Methods

Study Sample and Setting

This study was performed at the Albert Einstein College of Medicine and did not require Institutional Review Board approval. The researchers carried out question sampling and data collection using personal AMBOSS and ChatGPT accounts.

The online AMBOSS question bank was scraped for all available medical ethics questions (n = 27) pertaining to USMLE Step 1. Each question was recorded along with the AMBOSS-assigned difficulty level for medical students and average student accuracy. Questions were prelabeled by AMBOSS into categories such as Exam, System, and Discipline. Twenty-seven USMLE Step 1 medical knowledge questions distributed across the 15 organ systems were selected and were matched for difficulty level to the medical ethics questions (Difficulty distribution: Easy 77.7%, Medium 11.11%, Hard 11.11%). Each trial consisted of a 54-question “exam” which was administered to ChatGPT in two parts consisting of a medical ethics section and a medical knowledge section. Two sample questions similar to the ones we used can be found in [Supplementary Appendix A]> (available in the online version).


Data Collection

We administered the compiled multiple choice examination to GPT-3.5 and GPT-4 on the OpenAI website along with the prompt “Please answer each multiple choice question with only the letter, in the following format: A B” and recorded the letter answer choices produced by ChatGPT. We ran 30 trials of the 27 medical knowledge questions on versions 3.5 and 4, followed by 30 trials of the medical ethics questions on both versions. A new chat session was opened for each trial, and all data were collected in February of 2024. At the time of data collection, the most recent OpenAI knowledge refresh was October 2023, meaning GPT-3.5 and 4 had access to current events and internet data up to October 2023.[1]


Statistical Analysis

A normal probability plot of the paired differences was created to confirm the normality assumptions in our analysis ([Fig. 1]). To compare the accuracy of responses generated by GPT-3.5 to those of GPT-4, and to contrast those comparisons in the ethical and medical domains, we fit a random-effects linear probability regression model with a dichotomous outcome variable (correct or incorrect response), and dichotomous explanatory variables GPT version and test domain (medical ethics vs. medical knowledge) as well as their interaction. The model included a random intercept at the test question level. All statistical analyses were carried out using Stata version 18MP4 and a significance level of p < 0.05.

Zoom
Fig. 1 Normal probability plot of the paired differences to confirm assumptions of normality.

To determine response variability, we calculated the Shannon entropy of ChatGPT's answer choices across the 30 trials for both GPT-3.5 and GPT-4. Shannon entropy is widely used in information theory as a measure of uncertainty in the outcomes of a random variable.[14] An entropy value of 0 would signify the same answer selected in each of the 30 trials, whereas an even distribution of each of the five answer choices being chosen six times each would have an entropy of 2.32. We calculated the average entropy of the response distribution across the 27 questions for both versions and topics using Stata version 18MP4. Student response accuracy was collected directly from AMBOSS which reports the percentage of students that selected an answer choice for a given question. Data such as the total number of students who attempted a specific question, or attempted a question multiple times, are not available. Therefore, Shannon entropy and other summary statistics could not be calculated for student response. Stata code used for both the linear regression and Shannon entropy is included in our [Supplementary Materials S1] and [S2 ](available in the online version).



Results

GPT Performance Accuracy on Medical versus Ethics Questions

Our random-effects linear probability regression 7 model demonstrated that both versions of ChatGPT had superior performance regarding accuracy on medical questions compared to ethics questions. GPT-3.5 had an accuracy of 54% (95% CI: 42, 66) on medical knowledge questions and 47% (95% CI: 35, 59) on medical ethics questions. GPT-4 had an accuracy of 88% (95% CI: 76, 100) on medical knowledge questions and 70% (95% CI: 58, 82) on medical ethics questions ([Table 1]). The mean accuracy difference between the medical knowledge and medical ethics domains is 7% points (95% CI: 0, 24; p = 0.41) for GPT-3.5 and 18% points (95% CI: 1, 35; p = 0.036) for GPT-4. Therefore, although the language model had superior performance on medical knowledge questions compared to medical ethics questions for both versions, only the difference in GPT-4 proved to be statistically significant. Student user data from AMBOSS show an average accuracy of 79% on medical ethics questions and 73% correct on the medical knowledge questions ([Table 1]).

Table 1

Average accuracy of GPT-3.5, GPT-4, and student performance on medical and ethics questions

GPT-3.5 Questions answered correctly (%) with 95% CI

GPT-4 Questions answered correctly (%) with 95% CI

Student questions answered correctly (%)

GPT-3.5 versus GPT-4 mean difference

GPT-3.5 versus GPT-4 p-value

Medical ethics questions (n = 27)

47 (35–59)

70 (50–82)

79[a]

22 (19–26)

<0.001

Medical knowledge questions (n = 27)

54 (42–66)

88 (76–100)

73[a]

33 (30–37)

<0.001

Abbreviation: GPT, Generative Pre-Trained Transformer.


a Summary statistics were not available for student accuracy as numbers were obtained directly from AMBOSS.



Performance Accuracy of GPT-3.5 versus GPT-4

When comparing the accuracy of the two versions, GPT-4 significantly outperformed GPT-3.5 across subject domains ([Table 1]). The mean accuracy difference between GPT-3.5 and GPT-4 is 22% points (95% CI: 19, 26; p < 0.001) in the medical ethics domain and 33% points (95% CI: 30, 37; p < 0.001) in the medical knowledge domain. Furthermore, GPT-4 exceeded the USMLE passing score of 60% in both subject domains, while GPT-3.5 did not pass in either subject domain ([Fig. 2]).

Zoom
Fig. 2 GPT-3.5 and GPT-4 mean accuracy for medical ethics and medical knowledge questions plotted with 95% CI. GPT, Generative Pre-Trained Transformer; USMLE, United States Medical Licensing Examination.

Response Variability

We found the Shannon entropy for GPT-4 (0.21 for ethical questions and 0.11 for medical questions) to be substantially lower than GPT-3.5 (0.59 and 0.55, respectively) across both subject areas indicating lower response variability ([Table 2]). The difference in mean entropy between GPT-4 and GPT-3.5 among medical ethics questions is 0.38 (95% CI: 0.12–0.65; p < 0.005), while the difference in mean entropy between GPT-4 and GPT-3.5 among medical knowledge questions is 0.45 (95% CI: 0.18–0.71; p < 0.001). The difference in entropy across the two subjects, medical ethics and medical knowledge, was not statistically significant with p = 0.738.

Table 2

Summary statistics for entropy of responses for GPT-3.5 and GPT-4 on ethics and medical questions

GPT-3.5

GPT-4

Difference between versions 3.5 and 4 (95% CI)

p-Value

Medical ethics questions (n = 27)

0.59

0.21

0.38 (0.12–0.65)

<0.005

Medical knowledge questions (n = 27)

0.55

0.11

0.45 (0.18–0.71)

<0.001

Abbreviation: GPT, Generative Pre-Trained Transformer.




Discussion

GPT Performance Accuracy on Medical versus Ethics Questions

GPT-4 was significantly more accurate in answering medical knowledge questions than medical ethics questions. This finding elucidates an important caveat of the use of ChatGPT as a study tool for the USMLE Step 1 examination. Although research assessing ChatGPT's moral reasoning is limited, previous work by Krügel et al found that when responding to ethical dilemmas like the Trolly Problem, ChatGPT gives inconsistent responses across trials and the responses provided are surface-level.[15] Our work similarly finds that ChatGPT is less reliable in the realm of medical ethics compared to medical knowledge. However, this prior work emphasized ChatGPT was limited in the realm of morality due to inconsistency in responses to ethical dilemmas, implying that the LLM does not have a singular “moral compass” by which it answers ethics questions. In contrast, our work found no statistical difference in the response variation (Shannon entropy) between medical ethics and medical knowledge questions. Rather, ChatGPT-4 shows comparable consistency of responses to ethical questions but exhibits lower accuracy. Thus, despite our results demonstrating GPT-4's improved response variability, the lower overall accuracy in the ethical domain compared to the medical domain as well as compared to medical students highlights an area of relative weakness for ChatGPT. The student accuracy reported by AMBOSS demonstrated that the average medical student has a higher accuracy on ethics questions than on medical questions, further contrasting the AI and human accuracy capabilities. One potential explanation for why these ethics questions are seen as “easy questions” for medical students is because answering them correctly is based on intrinsic critical thinking and moral judgment skills that people accumulate over a lifetime, and it is still uncertain to what extent ChatGPT develops moral reasoning.[16] An important difference to note in our research is that Krügel et al collected their data in December 2022, shortly after ChatGPT was released to the public by OpenAI, on one of the earliest versions of ChatGPT-3.5. As is the nature of LLMs, ChatGPT continually changes its algorithm through each version update, limiting the comparison of results from December 2022 and February 2024.


Performance Accuracy of GPT-3.5 versus GPT-4

Our data demonstrate an improved capability of GPT-4 in answering USMLE-style Step 1 questions compared to GPT-3.5. This finding is in agreement with other recent work demonstrating the superior performance of GPT-4 to the previously reported accuracy of GPT-3.5.[17] Our findings are also consistent with previous reporting that the accuracy of the free GPT-3.5 platform's responses went down since the introduction of the paid version of GPT-4.[3] In contrast to research by Gilson et al, from 2023, that utilized the older version of 3.5,[5] our data as of February 2024 shows that GPT-3.5 no longer accurately responds to USMLE-style questions with a rate correlating to a passing score on Step 1. The threshold for passing a USMLE examination is around 60% and GPT-3.5 on average scored 54% on the medical questions and 47% on the ethics questions, indicating a failing score ([Table 1]). These results are further emphasized when considering that this selection of 27 questions had an easier distribution of difficulty on average than a typical USMLE examination would have since the majority of the ethics questions were of a level 1 difficulty (n = 21) according to the AMBOSS platform and we matched the medical questions to the same difficulty distribution. Students and educators should be aware of this decline in performance when using the currently available free version of GPT 3.5 as a study tool and may want to consider switching to the paid version 4 for increased accuracy.

Despite the shortcomings of GPT-3.5, our data show significant improvement of GPT-4 compared to GPT-3.5 to the same question set administered on the same day. GPT-4 scored on average 88% correct on medical knowledge questions and 70% on medical ethics questions. These numbers surpass the previously reported ability of GPT-3.5, and the LLM outperforms student users on AMBOSS on medical knowledge questions. These improvements demonstrate the promising utility of ChatGPT's ability to correctly answer medical knowledge and medical ethics questions.


Response Variability

Additional improvements of the newest GPT-4 version can be seen in the lower Shannon entropy. Shannon entropy is a statistical measure quantifying the variability of the responses given by ChatGPT.[14] This is an important measurement for evaluating a model because although ChatGPT is generating a new response for each prompt, if asked the same multiple choice question multiple times an ideal model would give the same correct answer each time.[18] Our results show that GPT-4 has lower entropy and therefore, more consistent outputs than version 3.5. Additionally, our analysis shows that the subject area does not correlate to the consistency of outputs. In contrast to the findings from Krügel et al, ChatGPT showed no significant difference in entropy for medical knowledge responses and medical ethics responses. Thus, while one may have presumed that ChatGPT performs worse on medical ethics questions because it is unsure of the correct answer and chooses randomly, we found that ChatGPT-4 consistently chooses the same answer (whether correct or incorrect) at a similar rate to medical knowledge questions. In more simple terms, when selecting the incorrect answer to a medical ethics question, ChatGPT more often than not chooses the same incorrect answer, perhaps implying a different understanding of ethics than medical students.


Limitations

This study used a small sample size of questions (n = 54) to evaluate ChatGPT's performance due to the limited number of medical ethics questions contained in the AMBOSS question bank. The majority of the AMBOSS medical ethics questions were rated a level 1, indicating they were easy for the average medical student according to the AMBOSS platform, and there were only six questions of levels 2 and 3 difficulty for each subject domain. We matched the medical ethics and medical knowledge questions in terms of difficulty; however, this caused our entire question set to be easier for a medical student than a typical USMLE examination. Additionally, because confidence intervals are not provided for student accuracy on AMBOSS questions, we are unable to definitively determine if there was a significant difference between ChatGPT and student performance.

It is reasonable to assume that both versions of ChatGPT were not naive to the AMBOSS question sets. OpenAI incorporates user input into its training algorithm for ChatGPT, therefore even though AMBOSS content is behind a paywall, if a student puts an AMBOSS question into ChatGPT asking for an explanation the model is trained on that data. This study was inspired by previous research which input AMBOSS questions into ChatGPT,[5] as well as by anecdotal evidence of medical students utilizing ChatGPT exactly for this purpose. Thus, it is reasonable to assume that ChatGPT has been trained on many, if not all, of the questions included in our dataset.

Future research could include increasing question sample size and evaluating accuracy by difficulty level. One direction could be to evaluate the 120 questions in the National Board of Medical Examiners (NBME) free online practice examination[19] known as the “free 120,” which are known to be more proportional to the difficulty of the USMLE Step 1 examination, and have been used in previous studies.[5] We did not use the NBME free 120 for this experiment because it does not have a sufficient number of medical ethics questions, however, it could be a helpful addition to increase sample size and question difficulty. Future research could also investigate framing questions using different prompts, for example, “If you are a medical student taking Step 1 how would you answer the following question?” Clarifying that questions should be answered in the context of the Step 1 examination could influence ChatGPT to answer differently, particularly with regard to bioethics which may have less of a clear correct answer. The lack of prompt engineering and prompt optimization tailored to our research question was a limitation of this study. A single prompt, “Please answer each multiple choice question with only the letter, in the following format: A B” followed by our Multiple Choice Question was utilized, without testing of different prompts to see the potential impact on the results. This is an important consideration given recent research that variations in LLM prompts can significantly influence outcomes and accuracy.[20]



Conclusion

ChatGPT-3.5 and GPT-4 performed better on medical questions than ethics questions, despite AMBOSS data showing an inverse association among medical students. Further research is required to explore why this is the case. When comparing the two versions of the algorithm, GPT-3.5 and GPT-4, version 4 was significantly higher performing than 3.5 in the metrics of accuracy as well as the variability of responses. GPT-4 performed very well on both ethics and medical questions, with a rate correlating to a passing score in both subject domains. Inconsistent with previous findings, GPT-3.5 was not able to answer USMLE-style Step 1 questions with a rate correlating to a passing score on the USMLE Step 1 examination. Ultimately, GPT-4 did remarkably well on both medical knowledge and ethical questions compared to GPT-3.5, demonstrating the tangible improvements and forward progress in AI's capabilities within a short time span.

The difference between GPT-3.5 and GPT-4 on USMLE-style questions is vital for students and educators to be aware of. It may be the case that most students and educators using ChatGPT as a study tool are using the free version, GPT 3.5, which is less accurate than GPT-4. Additionally, it is important for students, medical professionals, and educators to be aware that ChatGPT's accuracy varies across subject areas and over time. As an LLM, these models are not stagnant therefore repeated evaluation of these algorithms is necessary should we incorporate their usage into medical education and health care as a whole.


Clinical Relevance Statement

This study provides further evidence that the accuracy of ChatGPT varies significantly across USMLE subject areas and versions. Research aimed at understanding the performance and limitations of this AI software is essential as the use of these platforms increases. Further research should be constructed to understand what specifically caused ChatGPT to underperform in the subject area of medical ethics compared to medical knowledge-based questions.


Multiple-Choice Questions

  1. Which of the following is a challenge in the assessment of LLMs such as ChatGPT?

    • The lack of standardized assessment measures

    • The ability of the model to change over time

    • The accessibility of these models online

    • The affordability of using LLM

    Correct answer: Option b. LLMs can change over time, which makes any assessment temporal in nature. As mentioned in our paper, ChatGPT version 3.5 passed the USMLE in December 2022, however when measured in February 2024, it did not meet the passing threshold. This demonstrates the importance of reassessment of LLMs; just because it once performed at a certain level does not mean it will always remain at that level. Options a., c., and d. are incorrect. Option a. is incorrect because standardized tests like the USMLE have been widely used to assess LLM capabilities. Options c. and d. are incorrect because version 3.5 of ChatGPT can be accessed online with a free account and is widely accessible to anyone with a device and internet connection.

  2. Based on the results of our study, in which of the following combinations of version and subject area would you anticipate ChatGPT to perform most accurately?

    • ChatGPT-3.5, medical knowledge subject area

    • ChatGPT-3.5, medical ethics subject area

    • ChatGPT-4, medical knowledge subject area

    • ChatGPT-4, medical ethics subject area

    Correct answer: Option c. Our results demonstrate that ChatGPT version significantly 4 outperforms 3.5 and that ChatGPT performed more accurately on medical knowledge-based USMLE questions compared to medical ethics questions. Therefore, option c. would have the highest accuracy.



Conflict of Interest

None declared.

Acknowledgments

We sincerely thank the Albert Einstein College of Medicine for their educational resources and support, and the Albert Einstein College of Medicine Medical AI Interest Group for facilitating our collaboration.

Protection of Human and Animal Subjects

No human or animal subjects were included in this project.


* These authors contributed equally to the manuscript.



Address for correspondence

Tessa Danehy, BA
Albert Einstein College of Medicine, Montefiore Medical Center
Bronx 10461
United States   

Publication History

Received: 01 May 2024

Accepted: 27 August 2024

Accepted Manuscript online:
29 August 2024

Article published online:
04 December 2024

© 2024. Thieme. All rights reserved.

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany


Zoom
Fig. 1 Normal probability plot of the paired differences to confirm assumptions of normality.
Zoom
Fig. 2 GPT-3.5 and GPT-4 mean accuracy for medical ethics and medical knowledge questions plotted with 95% CI. GPT, Generative Pre-Trained Transformer; USMLE, United States Medical Licensing Examination.