Extracting International Classification of Diseases Codes from Clinical Documentation Using Large Language Models

Ashley Simmons; Kullaya Takkavatakarn; Megan McDougal; Brian Dilcher; Jami Pincavitch; Lukas Meadows; Justin Kauffman; Eyal Klang; Rebecca Wig; Gordon Smith; Ali Soroush; Robert Freeman; Donald J. Apakama; Alexander W. Charney; Roopa Kohli-Seth; Girish N. Nadkarni; Ankit Sakhuja

doi:10.1055/a-2491-3872

RSS-Feed abonnieren

Bitte kopieren Sie die angezeigte URL und fügen sie dann in Ihren RSS-Reader ein.

https://www.thieme-connect.de/rss/thieme/de/10.1055-s-00035026.xml

Teilen / Bookmarken

Facebook Linkedin Weibo

PDF herunterladen

Appl Clin Inform 2025; 16(02): 337-344
DOI: 10.1055/a-2491-3872

Research Article

Extracting International Classification of Diseases Codes from Clinical Documentation Using Large Language Models

Ashley Simmons‡^#

¹Department of Human Performance – Health Informatics and Information Management, West Virginia University, Morgantown, West Virginia, United States

,

Kullaya Takkavatakarn‡^#

²Division of Nephrology, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

³Division of Nephrology, Department of Medicine, King Chulalongkorn Memorial Hospital, Chulalongkorn University, Bangkok, Thailand

,

Megan McDougal‡^#

¹Department of Human Performance – Health Informatics and Information Management, West Virginia University, Morgantown, West Virginia, United States

,

Brian Dilcher

⁴Department of Emergency Medicine, West Virginia University, Morgantown, West Virginia, United States

,

Jami Pincavitch

⁵Department of Orthopedics, West Virginia University, Morgantown, West Virginia, United States

,

Lukas Meadows

⁶Department of Radiology and Imaging Sciences, Emory University, Atlanta, Georgia, United States

,

Justin Kauffman

⁷Division of Data Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

⁸The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

,

Eyal Klang

⁷Division of Data Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

⁸The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

,

Rebecca Wig

⁹Department of Medicine, The University of Arizona, Tucson, Arizona, United States

,

Gordon Smith

¹⁰Department of Epidemiology and Biostatistics, West Virginia University, Morgantown, West Virginia, United States

,

Ali Soroush

⁷Division of Data Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

⁸The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

¹¹Division of Gastroenterology, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

,

Robert Freeman

⁸The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

,

Donald J. Apakama

¹²Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

,

Alexander W. Charney

¹³Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, United States

¹⁴Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States

¹⁵Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, New York, United States

¹⁶Department of Neurosurgery, Icahn School of Medicine at Mount Sinai, New York, New York, United States

,

Roopa Kohli-Seth

¹⁷Institute for Critical Care Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

,

Girish N. Nadkarni‡^*

²Division of Nephrology, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

⁷Division of Data Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

⁸The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

,

Ankit Sakhuja‡^*

⁷Division of Data Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

⁸The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

¹⁷Institute for Critical Care Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States

› Institutsangaben Funding This study was funded by the U.S. Department of Health and Human Services, National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases NIH/NIDDK grant (grant no.: K08DK131286).

› Weitere Informationen

Auch verfügbar auf

Abstract
Volltext
Referenzen
Zusatzmaterial

Lizenzen und Reprints

Abstract

Background Large language models (LLMs) have shown promise in various professional fields, including medicine and law. However, their performance in highly specialized tasks, such as extracting ICD-10-CM codes from patient notes, remains underexplored.

Objective The primary objective was to evaluate and compare the performance of ICD-10-CM code extraction by different LLMs with that of human coder.

Methods We evaluated performance of six LLMs (GPT-3.5, GPT-4, Claude 2.1, Claude 3, Gemini Advanced, and Llama 2-70b) in extracting ICD-10-CM codes against human coder. We used deidentified inpatient notes of authentic patient cases from American Health Information Management Association Vlab for this study. We calculated percent agreement and Cohen's kappa values to assess the agreement between LLMs and human coder. We then identified reasons for discrepancies in code extraction by LLMs in a 10% random subset.

Results Among 50 inpatient notes, human coder extracted 165 unique ICD-10-CM codes. LLMs extracted significantly higher number of unique ICD-10-CM codes than human coder, with Llama 2-70b extracting most (658) and Gemini Advanced the least (221). GPT-4 achieved highest percent agreement with human coder at 15.2%, followed by Claude 3 (12.7%) and GPT-3.5 (12.4%). Cohen's kappa values indicated minimal to no agreement, ranging from −0.02 to 0.01. When focusing on primary diagnosis, Claude 3 achieved highest percent agreement (26%) and kappa value (0.25). Reasons for discrepancies in extraction of codes varied among LLMs and included extraction of codes for diagnoses not confirmed by providers (60% with GPT-4), extraction of nonspecific codes (25% with GPT-3.5), extraction of codes for signs and symptoms despite presence of more specific diagnosis (22% with Claude 2.1), and hallucinations (35% with Claude 2.1).

Conclusion Current LLMs have poor performance in extraction of ICD-10-CM codes from inpatient notes when compared against the human coder.

Keywords

ICD-10 - artificial intelligence - billing - human–computer interaction - billing systems

Protection of Human and Animal Subjects

Human and animal subjects were not included in the project.

^# Equal contribution as first author.

^* Equal contribution as senior author.

Supplementary Material

Supplementary Material

Publikationsverlauf

Eingereicht: 06. Juni 2024

Angenommen: 27. November 2024

Accepted Manuscript online:
28. November 2024

Artikel online veröffentlicht:
16. April 2025

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

References
1 Zippia. Medical Biller Coder Demographics and Statistics in the US. Zippia. Accessed May 25, 2024 at: https://www.zippia.com/medical-biller-coder-jobs/demographics/

PubMed
2 AHIMA. Certified Coding Specialist. AHIMA; . Accessed May 25, 2024 at: https://www.ahima.org/certification-careers/certification-exams/ccs/
3 Services CfMM. ICD-10-CM Official Guidelines for Coding and Reporting. Updated 2022. Accessed May 25, 2024 at: https://www.cms.gov/files/document/fy-2022-icd-10-cm-coding-guidelines-updated-02012022.pdf

PubMed
4 Campbell S, Giadresco K. Computer-assisted clinical coding: a narrative review of the literature on its benefits, limitations, implementation and impact on clinical coding professionals. HIM J 2020; 49 (01) 5-18

Crossref PubMed Suche in Google Scholar
5 Stanfill MH, Marc DT. Health information management: implications of artificial intelligence on healthcare data and information management. Yearb Med Inform 2019; 28 (01) 56-64

PubMed Suche in Google Scholar
6 Nguyen AN, Truran D, Kemp M. et al. Computer-assisted diagnostic coding: effectiveness of an NLP-based approach using SNOMED CT to ICD-10 mappings. AMIA Annu Symp Proc 2018; 2018: 807-816

PubMed Suche in Google Scholar
7 Perera S, Sheth A, Thirunarayan K. et al. Challenges in understanding clinical notes. Proceedings of the 2013 International Workshop on Data Management & Analytics for Healthcare - DARE '132013

PubMed
8 USMLE. United State Medical Licensing Examination. Accessed December 8, 2023 at: USMLE.org

PubMed
9 Bommarito MJ, Katz DM. GPT Takes the Bar Exam. SSRN2022

PubMed
10 Van Veen D, Van Uden C, Blankemeier L. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med 2024; 30 (04) 1134-1142

Crossref PubMed Suche in Google Scholar
11 Lee SA, Timothy L. Do large language models understand medical codes?. arXiv 2024;

Crossref PubMed Suche in Google Scholar
12 Soroush A, Glicksberg Benjamin S, Zimlichman E. et al. Large language models are poor medical coders—benchmarking of medical code querying. NEJM AI 2024; 1 (05) AIdbp2300040

PubMed Suche in Google Scholar
13 AHIMA VLAB. 2023 . Accessed June 26, 23 at: https://myahima.brightspace.com/

PubMed Suche in Google Scholar
14 3M. AHIMA; 2023. Accessed June 26, 2023 at: https://myahima.brightspace.com

PubMed
15 van Melle MA, Zwart DLM, Poldervaart JM. et al. Validity and reliability of a medical record review method identifying transitional patient safety incidents in merged primary and secondary care patients' records. BMJ Open 2018; 8 (08) e018576

PubMed Suche in Google Scholar
16 Miao J, Thongprayoon C, Cheungpasitporn W. Assessing the accuracy of ChatGPT on core questions in glomerular disease. Kidney Int Rep 2023; 8 (08) 1657-1659

PubMed Suche in Google Scholar
17 American Society of Nephrology. Kidney Self-Assessment Program. Accessed at: https://www.asn-online.org/education/ksap/

PubMed
18 nephSAP. Nephrology Self-Assessment Program. Accessed at: https://nephsap.org/

PubMed
19 Wu S, Koo M, Blum L. et al. Benchmarking open-source large language models, GPT-4 and Claude 2 on multiple-choice questions in nephrology. NEJM AI 2024; 1 (02) AIdbp2300092

PubMed Suche in Google Scholar
20 Suchman K, Garg S, Trindade AJ. Chat generative pretrained transformer fails the multiple-choice American College of Gastroenterology Self-Assessment Test. Am J Gastroenterol 2023; 118 (12) 2280-2282

Crossref PubMed Suche in Google Scholar
21 Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol 2023; 141 (06) 589-597

PubMed Suche in Google Scholar
22 Deebel NA, Terlecki R. ChatGPT performance on the American Urological Association Self-assessment Study Program and the potential influence of artificial intelligence in urologic training. Urology 2023; 177: 29-33

PubMed Suche in Google Scholar
23 Kocaman V. Comparing Spark NLP for Healthcare and ChatGPT in Extracting ICD10-CM Codes from Clinical Notes. Accessed April 20, 2024 at: https://www.johnsnowlabs.com/comparing-spark-nlp-for-healthcare-and-chatgpt-in-extracting-icd10-cm-codes-from-clinical-notes/

PubMed

Zusatzmaterial

Supplementary Material

RSS-Feed abonnieren

Teilen / Bookmarken

Extracting International Classification of Diseases Codes from Clinical Documentation Using Large Language Models

Abstract

Keywords

Protection of Human and Animal Subjects

Supplementary Material

Publikationsverlauf

References