Appl Clin Inform 2025; 16(02): 337-344
DOI: 10.1055/a-2491-3872
Research Article

Extracting International Classification of Diseases Codes from Clinical Documentation Using Large Language Models

Ashley Simmons#
1   Department of Human Performance – Health Informatics and Information Management, West Virginia University, Morgantown, West Virginia, United States
,
Kullaya Takkavatakarn#
2   Division of Nephrology, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
3   Division of Nephrology, Department of Medicine, King Chulalongkorn Memorial Hospital, Chulalongkorn University, Bangkok, Thailand
,
Megan McDougal#
1   Department of Human Performance – Health Informatics and Information Management, West Virginia University, Morgantown, West Virginia, United States
,
Brian Dilcher
4   Department of Emergency Medicine, West Virginia University, Morgantown, West Virginia, United States
,
Jami Pincavitch
5   Department of Orthopedics, West Virginia University, Morgantown, West Virginia, United States
,
Lukas Meadows
6   Department of Radiology and Imaging Sciences, Emory University, Atlanta, Georgia, United States
,
Justin Kauffman
7   Division of Data Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
8   The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
,
Eyal Klang
7   Division of Data Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
8   The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
,
Rebecca Wig
9   Department of Medicine, The University of Arizona, Tucson, Arizona, United States
,
Gordon Smith
10   Department of Epidemiology and Biostatistics, West Virginia University, Morgantown, West Virginia, United States
,
Ali Soroush
7   Division of Data Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
8   The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
11   Division of Gastroenterology, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
,
Robert Freeman
8   The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
,
Donald J. Apakama
12   Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
,
Alexander W. Charney
13   Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, United States
14   Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States
15   Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, New York, United States
16   Department of Neurosurgery, Icahn School of Medicine at Mount Sinai, New York, New York, United States
,
Roopa Kohli-Seth
17   Institute for Critical Care Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
,
Girish N. Nadkarni*
2   Division of Nephrology, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
7   Division of Data Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
8   The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
,
Ankit Sakhuja*
7   Division of Data Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
8   The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
17   Institute for Critical Care Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, United States
› Institutsangaben
Funding This study was funded by the U.S. Department of Health and Human Services, National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases NIH/NIDDK grant (grant no.: K08DK131286).

Abstract

Background Large language models (LLMs) have shown promise in various professional fields, including medicine and law. However, their performance in highly specialized tasks, such as extracting ICD-10-CM codes from patient notes, remains underexplored.

Objective The primary objective was to evaluate and compare the performance of ICD-10-CM code extraction by different LLMs with that of human coder.

Methods We evaluated performance of six LLMs (GPT-3.5, GPT-4, Claude 2.1, Claude 3, Gemini Advanced, and Llama 2-70b) in extracting ICD-10-CM codes against human coder. We used deidentified inpatient notes of authentic patient cases from American Health Information Management Association Vlab for this study. We calculated percent agreement and Cohen's kappa values to assess the agreement between LLMs and human coder. We then identified reasons for discrepancies in code extraction by LLMs in a 10% random subset.

Results Among 50 inpatient notes, human coder extracted 165 unique ICD-10-CM codes. LLMs extracted significantly higher number of unique ICD-10-CM codes than human coder, with Llama 2-70b extracting most (658) and Gemini Advanced the least (221). GPT-4 achieved highest percent agreement with human coder at 15.2%, followed by Claude 3 (12.7%) and GPT-3.5 (12.4%). Cohen's kappa values indicated minimal to no agreement, ranging from −0.02 to 0.01. When focusing on primary diagnosis, Claude 3 achieved highest percent agreement (26%) and kappa value (0.25). Reasons for discrepancies in extraction of codes varied among LLMs and included extraction of codes for diagnoses not confirmed by providers (60% with GPT-4), extraction of nonspecific codes (25% with GPT-3.5), extraction of codes for signs and symptoms despite presence of more specific diagnosis (22% with Claude 2.1), and hallucinations (35% with Claude 2.1).

Conclusion Current LLMs have poor performance in extraction of ICD-10-CM codes from inpatient notes when compared against the human coder.

Protection of Human and Animal Subjects

Human and animal subjects were not included in the project.


# Equal contribution as first author.


* Equal contribution as senior author.


Supplementary Material



Publikationsverlauf

Eingereicht: 06. Juni 2024

Angenommen: 27. November 2024

Accepted Manuscript online:
28. November 2024

Artikel online veröffentlicht:
16. April 2025

© 2025. Thieme. All rights reserved.

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany