RSS-Feed abonnieren
DOI: 10.1055/a-2491-3872
Extracting International Classification of Diseases Codes from Clinical Documentation Using Large Language Models
Funding This study was funded by the U.S. Department of Health and Human Services, National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases NIH/NIDDK grant (grant no.: K08DK131286).
Abstract
Background Large language models (LLMs) have shown promise in various professional fields, including medicine and law. However, their performance in highly specialized tasks, such as extracting ICD-10-CM codes from patient notes, remains underexplored.
Objective The primary objective was to evaluate and compare the performance of ICD-10-CM code extraction by different LLMs with that of human coder.
Methods We evaluated performance of six LLMs (GPT-3.5, GPT-4, Claude 2.1, Claude 3, Gemini Advanced, and Llama 2-70b) in extracting ICD-10-CM codes against human coder. We used deidentified inpatient notes of authentic patient cases from American Health Information Management Association Vlab for this study. We calculated percent agreement and Cohen's kappa values to assess the agreement between LLMs and human coder. We then identified reasons for discrepancies in code extraction by LLMs in a 10% random subset.
Results Among 50 inpatient notes, human coder extracted 165 unique ICD-10-CM codes. LLMs extracted significantly higher number of unique ICD-10-CM codes than human coder, with Llama 2-70b extracting most (658) and Gemini Advanced the least (221). GPT-4 achieved highest percent agreement with human coder at 15.2%, followed by Claude 3 (12.7%) and GPT-3.5 (12.4%). Cohen's kappa values indicated minimal to no agreement, ranging from −0.02 to 0.01. When focusing on primary diagnosis, Claude 3 achieved highest percent agreement (26%) and kappa value (0.25). Reasons for discrepancies in extraction of codes varied among LLMs and included extraction of codes for diagnoses not confirmed by providers (60% with GPT-4), extraction of nonspecific codes (25% with GPT-3.5), extraction of codes for signs and symptoms despite presence of more specific diagnosis (22% with Claude 2.1), and hallucinations (35% with Claude 2.1).
Conclusion Current LLMs have poor performance in extraction of ICD-10-CM codes from inpatient notes when compared against the human coder.
Protection of Human and Animal Subjects
Human and animal subjects were not included in the project.
# Equal contribution as first author.
* Equal contribution as senior author.
Publikationsverlauf
Eingereicht: 06. Juni 2024
Angenommen: 27. November 2024
Accepted Manuscript online:
28. November 2024
Artikel online veröffentlicht:
16. April 2025
© 2025. Thieme. All rights reserved.
Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany
-
References
- 1 Zippia. Medical Biller Coder Demographics and Statistics in the US. Zippia. Accessed May 25, 2024 at: https://www.zippia.com/medical-biller-coder-jobs/demographics/
- 2 AHIMA. Certified Coding Specialist. AHIMA; . Accessed May 25, 2024 at: https://www.ahima.org/certification-careers/certification-exams/ccs/
- 3 Services CfMM. ICD-10-CM Official Guidelines for Coding and Reporting. Updated 2022. Accessed May 25, 2024 at: https://www.cms.gov/files/document/fy-2022-icd-10-cm-coding-guidelines-updated-02012022.pdf
- 4 Campbell S, Giadresco K. Computer-assisted clinical coding: a narrative review of the literature on its benefits, limitations, implementation and impact on clinical coding professionals. HIM J 2020; 49 (01) 5-18
- 5 Stanfill MH, Marc DT. Health information management: implications of artificial intelligence on healthcare data and information management. Yearb Med Inform 2019; 28 (01) 56-64
- 6 Nguyen AN, Truran D, Kemp M. et al. Computer-assisted diagnostic coding: effectiveness of an NLP-based approach using SNOMED CT to ICD-10 mappings. AMIA Annu Symp Proc 2018; 2018: 807-816
- 7 Perera S, Sheth A, Thirunarayan K. et al. Challenges in understanding clinical notes. Proceedings of the 2013 International Workshop on Data Management & Analytics for Healthcare - DARE '132013
- 8 USMLE. United State Medical Licensing Examination. Accessed December 8, 2023 at: USMLE.org
- 9 Bommarito MJ, Katz DM. GPT Takes the Bar Exam. SSRN2022
- 10 Van Veen D, Van Uden C, Blankemeier L. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med 2024; 30 (04) 1134-1142
- 11 Lee SA, Timothy L. Do large language models understand medical codes?. arXiv 2024;
- 12 Soroush A, Glicksberg Benjamin S, Zimlichman E. et al. Large language models are poor medical coders—benchmarking of medical code querying. NEJM AI 2024; 1 (05) AIdbp2300040
- 13 AHIMA VLAB. 2023 . Accessed June 26, 23 at: https://myahima.brightspace.com/
- 14 3M. AHIMA; 2023. Accessed June 26, 2023 at: https://myahima.brightspace.com
- 15 van Melle MA, Zwart DLM, Poldervaart JM. et al. Validity and reliability of a medical record review method identifying transitional patient safety incidents in merged primary and secondary care patients' records. BMJ Open 2018; 8 (08) e018576
- 16 Miao J, Thongprayoon C, Cheungpasitporn W. Assessing the accuracy of ChatGPT on core questions in glomerular disease. Kidney Int Rep 2023; 8 (08) 1657-1659
- 17 American Society of Nephrology. Kidney Self-Assessment Program. Accessed at: https://www.asn-online.org/education/ksap/
- 18 nephSAP. Nephrology Self-Assessment Program. Accessed at: https://nephsap.org/
- 19 Wu S, Koo M, Blum L. et al. Benchmarking open-source large language models, GPT-4 and Claude 2 on multiple-choice questions in nephrology. NEJM AI 2024; 1 (02) AIdbp2300092
- 20 Suchman K, Garg S, Trindade AJ. Chat generative pretrained transformer fails the multiple-choice American College of Gastroenterology Self-Assessment Test. Am J Gastroenterol 2023; 118 (12) 2280-2282
- 21 Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol 2023; 141 (06) 589-597
- 22 Deebel NA, Terlecki R. ChatGPT performance on the American Urological Association Self-assessment Study Program and the potential influence of artificial intelligence in urologic training. Urology 2023; 177: 29-33
- 23 Kocaman V. Comparing Spark NLP for Healthcare and ChatGPT in Extracting ICD10-CM Codes from Clinical Notes. Accessed April 20, 2024 at: https://www.johnsnowlabs.com/comparing-spark-nlp-for-healthcare-and-chatgpt-in-extracting-icd10-cm-codes-from-clinical-notes/