Evaluation of a Digital Scribe: Conversation Summarization for Emergency Department Consultation Calls

Emre Sezgin; Joseph W. Sirrianni; Kelly Kranz

doi:10.1055/a-2327-4121

Applied Clinical Informatics, Table of Contents

Appl Clin Inform 2024; 15(03): 600-611
DOI: 10.1055/a-2327-4121

Research Article

Evaluation of a Digital Scribe: Conversation Summarization for Emergency Department Consultation Calls

Emre Sezgin

¹Center for Biobehavioral Health, The Abigail Wexner Research Institute at Nationwide Children's Hospital, Columbus, Ohio, United States

²The Ohio State University College of Medicine, Columbus, Ohio, United States

,

Joseph W. Sirrianni

³IT Research and Innovation, The Abigail Wexner Research Institute at Nationwide Children's Hospital, Columbus, Ohio, United States

,

Kelly Kranz

⁴Physician Consult and Transfer Center, Nationwide Children's Hospital, Columbus, Ohio, United States

› Author Affiliations

Abstract

Objectives We present a proof-of-concept digital scribe system as an emergency department (ED) consultation call-based clinical conversation summarization pipeline to support clinical documentation and report its performance.

Methods We use four pretrained large language models to establish the digital scribe system: T5-small, T5-base, PEGASUS-PubMed, and BART-Large-CNN via zero-shot and fine-tuning approaches. Our dataset includes 100 referral conversations among ED clinicians and medical records. We report the ROUGE-1, ROUGE-2, and ROUGE-L to compare model performance. In addition, we annotated transcriptions to assess the quality of generated summaries.

Results The fine-tuned BART-Large-CNN model demonstrates greater performance in summarization tasks with the highest ROUGE scores (F1_ROUGE-1 = 0.49, F1_ROUGE-2 = 0.23, F1_ROUGE-L = 0.35) scores. In contrast, PEGASUS-PubMed lags notably (F1_ROUGE-1 = 0.28, F1_ROUGE-2 = 0.11, F1_ROUGE-L = 0.22). BART-Large-CNN's performance decreases by more than 50% with the zero-shot approach. Annotations show that BART-Large-CNN performs 71.4% recall in identifying key information and a 67.7% accuracy rate.

Conclusion The BART-Large-CNN model demonstrates a high level of understanding of clinical dialogue structure, indicated by its performance with and without fine-tuning. Despite some instances of high recall, there is variability in the model's performance, particularly in achieving consistent correctness, suggesting room for refinement. The model's recall ability varies across different information categories. The study provides evidence toward the potential of artificial intelligence-assisted tools in assisting clinical documentation. Future work is suggested on expanding the research scope with additional language models and hybrid approaches and comparative analysis to measure documentation burden and human factors.

Keywords

text summarization - emergency department - clinical conversation - pretrained language model - documentation burden

Full Text

References

References
1 Quiroz JC, Laranjo L, Kocaballi AB, Berkovsky S, Rezazadegan D, Coiera E. Challenges of developing a digital scribe to reduce clinical documentation burden. NPJ Digit Med 2019; 2: 114
2 Chandawarkar A, Chaparro JD. Burnout in clinicians. Curr Probl Pediatr Adolesc Health Care 2021; 51 (11) 101104
3 Joukes E, Abu-Hanna A, Cornet R, de Keizer NF. Time spent on dedicated patient care and documentation tasks before and after the introduction of a structured and standardized electronic health record. Appl Clin Inform 2018; 9 (01) 46-53
4 Moukarzel A, Michelet P, Durand AC. et al. Burnout syndrome among emergency department staff: prevalence and associated factors. BioMed Res Int 2019; 2019: 6462472
5 Moy AJ, Hobensack M, Marshall K. et al. Understanding the perceived role of electronic health records and workflow fragmentation on clinician documentation burden in emergency departments. J Am Med Inform Assoc 2023; 30 (05) 797-808
6 Morley C, Unwin M, Peterson GM, Stankovich J, Kinsman L. Emergency department crowding: a systematic review of causes, consequences and solutions. PLoS One 2018; 13 (08) e0203316
7 Kelen GD, Wolfe R, D'Onofrio G, Mills AM. Emergency department crowding: the canary in the health care system. . NEJM Catal 2021;2(05):
8 Colicchio TK, Cimino JJ, Del Fiol G. Unintended consequences of nationwide electronic health record adoption: challenges and opportunities in the post-meaningful use era. J Med Internet Res 2019; 21 (06) e13313
9 Reich J. The physician's view: healthcare digital transformation priorities and challenges. In: Hübner UH, Mustata WilsonG, Morawski TS, Ball MJ. eds. Nursing Informatics: A Health Informatics, Interprofessional and Global Perspective. Springer International Publishing;; 2022: 57-67
10 Holmgren AJ, Downing NL, Bates DW. et al. Assessment of electronic health record use between US and non-US health systems. JAMA Intern Med 2021; 181 (02) 251-259
11 Lavander P, Meriläinen M, Turkki L. Working time use and division of labour among nurses and health-care workers in hospitals - a systematic review. J Nurs Manag 2016; 24 (08) 1027-1040
12 Harris DA, Haskell J, Cooper E, Crouse N, Gardner R. Estimating the association between burnout and electronic health record-related stress among advanced practice registered nurses. Appl Nurs Res 2018; 43: 36-41
13 Wang J, Lavender M, Hoque E, Brophy P, Kautz H. A patient-centered digital scribe for automatic medical documentation. JAMIA Open 2021; 4 (01) ooab003
14 Shanafelt TD, Dyrbye LN, Sinsky C. et al. Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction. Mayo Clin Proc 2016; 91 (07) 836-848
15 Health worker burnout. Accessed November 27, 2023 at: https://www.hhs.gov/surgeongeneral/priorities/health-worker-burnout/index.html
16 AMIA 25 × 5. AMIA - American Medical Informatics Association. Accessed November 27, 2023 at: https://amia.org/about-amia/amia-25×5
17 25 by 5: Columbia leads symposium, ongoing efforts to reduce documentation burden on U.S. clinicians. Columbia DBMI. Accessed November 27, 2023 at: https://www.dbmi.columbia.edu/25×5/
18 Lin SY, Shanafelt TD, Asch SM. Reimagining clinical documentation with artificial intelligence. Mayo Clin Proc 2018; 93 (05) 563-565
19 Luh JY, Thompson RF, Lin S. Clinical documentation and patient care using artificial intelligence in radiation oncology. J Am Coll Radiol 2019; 16 (9 Pt B): 1343-1346
20 Bohr A, Memarzadeh K. Chapter 2 - The rise of artificial intelligence in healthcare applications. In: Bohr A, Memarzadeh K. eds. Artificial Intelligence in Healthcare. Academic Press;; 2020: 25-60
21 van Buchem MM, Boosman H, Bauer MP, Kant IMJ, Cammel SA, Steyerberg EW. The digital scribe in clinical practice: a scoping review and research agenda. NPJ Digit Med 2021; 4 (01) 57
22 Coiera E, Kocaballi B, Halamka J, Laranjo L. The digital scribe. NPJ Digit Med 2018; 1: 58
23 Goodwin TR, Savery ME, Demner-Fushman D. Flight of the PEGASUS? Comparing transformers on few-shot and zero-shot multi-document abstractive summarization. Proc Int Conf Comput Ling 2020; 2020: 5640-5646
24 Tierney AA, Gregg G, Brian H. et al. Ambient artificial intelligence scribes to alleviate the burden of clinical documentation. NEJM Catal 2024; 5 (03) CAT.23.0404
25 Rezazadegan D, Berkovsky S, Quiroz JC. et al. Automatic speech summarisation: a scoping review. arXiv [csCL] . Accessed August 27, 2020 at: http://arxiv.org/abs/2008.11897
26 Ghatnekar S, Faletsky A, Nambudiri VE. Digital scribe utility and barriers to implementation in clinical practice: a scoping review. Health Technol (Berl) 2021; 11 (04) 803-809
27 Zhang M, Zhou G, Yu W, Huang N, Liu W. A comprehensive survey of abstractive text summarization based on deep learning. Comput Intell Neurosci 2022; 2022: 7132226
28 Goyal T, Xu J, Li JJ, Durrett G. Training dynamics for text summarization models. arXiv [csCL] . Accessed October 15, 2021 at: http://arxiv.org/abs/2110.08370
29 Zhu C, Xu R, Zeng M, Huang X. A hierarchical network for abstractive meeting summarization with cross-domain pretraining. In: Cohn T, He Y, Liu Y. eds. Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics; 2020: 194-203
30 Nenkova A, McKeown K. Automatic Summarization. Foundations and Trends® in Information Retrieval. . 2011;5(2–3): 103-23310.1561/1500000015
31 Moratanch N, Chitrakala S. A survey on extractive text summarization. In: 2017 International Conference on Computer, Communication and Signal Processing (ICCCSP). IEEE; 2017: 1-6
32 Sotudeh S, Goharian N, Filice RW. Attend to medical ontologies: content selection for clinical abstractive summarization. arXiv [csCL] . Accessed May 1, 2020 at: http://arxiv.org/abs/2005.00163
33 Liu C, Wang P, Xu J, Li Z, Ye J. Automatic dialogue summary generation for customer service. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD '19. Association for Computing Machinery; 2019: 1957-1965
34 Lin H, Ng V. Abstractive summarization: a survey of the state of the art. AAAI 2019; 33 (01) 9815-9822
35 Brown T, Mann B, Ryder N. et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020; 33: 1877-1901
36 Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. . Adv Neural Inf Process Syst 2017;30
37 Lewis M, Liu Y, Goyal N. et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv [csCL] . Accessed October 29, 2019 at: http://arxiv.org/abs/1910.13461
38 Aghajanyan A, Gupta A, Shrivastava A, Chen X, Zettlemoyer L, Gupta S. Muppet: massive multi-task representations with pre-finetuning. arXiv [csCL] . Accessed January 26, 2021 at: http://arxiv.org/abs/2101.11038
39 Raffel C, Shazeer N, Roberts A. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 2020; 21 (01) 5485-5551
40 Chung HW, Hou L, Longpre S. et al. Scaling instruction-finetuned language models. arXiv [csLG] . Accessed October 20, 2022 at: http://arxiv.org/abs/2210.11416
41 Wang M, Wang M, Yu F, Yang Y, Walker J, Mostafa J. A systematic review of automatic text summarization for biomedical literature and EHRs. J Am Med Inform Assoc 2021; 28 (10) 2287-2297
42 Feng X, Feng X, Qin B. A survey on dialogue summarization: recent advances and new frontiers. arXiv [csCL] . Accessed July 7, 2021 at: http://arxiv.org/abs/2107.03175
43 Physician Direct Connect (PDC). Accessed November 21, 2023 at: https://www.nationwidechildrens.org/for-medical-professionals/refer-a-patient/referrals-and-scheduling/pdc
44 Epic. Accessed December 5, 2023 at: https://www.epic.com/
45 Amazon Web Services - Transcribe. . Accessed December 1, 2021 at: https://aws.amazon.com/transcribe/
46 Lewis M, Liu Y, Goyal N. et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2020: 7871-7880
47 Zhang J, Zhao Y, Saleh M, Liu P. PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. In: Iii HD, Singh A, eds. Proceedings of the 37th International Conference on Machine Learning. Vol. 119. Proceedings of Machine Learning Research; July 13–18, 2020: 11328-11339
48 Jia Z, Chen J, Xu X. et al. The importance of resource awareness in artificial intelligence for healthcare. Nat Mach Intell 2023; 5 (07) 687-698
49 Koch M, Arlandini C, Antonopoulos G. et al. HPC+ in the medical field: overview and current examples. Technol Health Care 2023; 31 (04) 1509-1523
50 Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv [csLG] . . Accessed November 14, 2017 at: http://arxiv.org/abs/1711.05101
51 Ecosystem. PyTorch. Accessed April 10, 2024 at: https://pytorch.org/ecosystem/
52 Models. Accessed April 10, 2024 at: https://huggingface.co/models
53 Python software. . Python.org. Accessed April 12, 2024 at: https://www.python.org/downloads/
54 Lin CY. ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out. Association for Computational Linguistics; 2004: 74-81
55 Liu S, McCoy AB, Wright AP. et al. Why do users override alerts? Utilizing large language model to summarize comments and optimize clinical decision support. J Am Med Inform Assoc 2024; 31 (06) 1388-1396
56 Van Veen D, Van Uden C, Blankemeier L. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med 2024; 30: 1134-1142
57 Cohen A, Kantor A, Hilleli S, Kolman E. Automatic rephrasing of transcripts-based action items. In: Zong C, Xia F, Li W, Navigli R. eds. Findings of the Association for Computational Linguistics. Association for Computational Linguistics; 2021. :2862–2873
58 Gao Y, Miller T, Xu D, Dligach D, Churpek MM, Afshar M. Summarizing patients' problems from hospital progress notes using pre-trained sequence-to-sequence models. Proc Int Conf Comput Ling 2022; 2022: 2979-2991
59 Gao J, Zhao H, Zhang Y, Wang W, Yu C, Xu R. Benchmarking large language models with augmented instructions for fine-grained information extraction. arXiv [csCL] . Accessed October 8, 2023 at: http://arxiv.org/abs/2310.05092
60 Nguyen QA, Duong QH, Nguyen MQ. et al. A hybrid multi-answer summarization model for the biomedical question-answering system. In: 2021 13th International Conference on Knowledge and Systems Engineering (KSE). IEEE; 2021: 1-6
61 Park J, Kotzias D, Kuo P. et al. Detecting conversation topics in primary care office visits from transcripts of patient-provider interactions. J Am Med Inform Assoc 2019; 26 (12) 1493-1504
62 Shortliffe EH, Sepúlveda MJ. Clinical decision support in the era of artificial intelligence. JAMA 2018; 320 (21) 2199-2200
63 Sezgin E. Artificial intelligence in healthcare: complementing, not replacing, doctors and healthcare providers. Digit Health 2023; 9: 20 552076231186520
64 Rousseau I, Fosse L, Dkhissi Y, Damnati G, Lecorvé G. Darbarer @ AutoMin2023. Transcription simplification for concise minute generation from multi-party conversations. International Conference on Natural Language Generation. Accessed December 1, 2023 at: https://www.semanticscholar.org/paper/3d8c3cd49045e8310174146e571fae7092c7a770
65 Nanayakkara G, Wiratunga N, Corsar D, Martin K, Wijekoon A. Clinical Dialogue Transcription Error Correction with Self-supervision. In: Artificial Intelligence XL. Springer Nature:; Switzerland: 2023: 33-46
66 Ganoe CH, Wu W, Barr PJ. et al. Natural language processing for automated annotation of medication mentions in primary care visit conversations. JAMIA Open 2021; 4 (03) ooab071
67 Smits M, Nacar M, DSLudden G, van Goor H. Stepwise design and evaluation of a values-oriented ambient intelligence healthcare monitoring platform. Value Health 2022; 25 (06) 914-923
68 Rao A, Pang M, Kim J. et al. Assessing the utility of ChatGPT throughout the entire clinical workflow: development and usability study. J Med Internet Res 2023; 25: e48659
69 Rudin RS, Perez S, Rodriguez JA. et al. User-centered design of a scalable, electronic health record-integrated remote symptom monitoring intervention for patients with asthma and providers in primary care. J Am Med Inform Assoc 2021; 28 (11) 2433-2444
70 McNab D, McKay J, Shorrock S, Luty S, Bowie P. Development and application of 'systems thinking' principles for quality improvement. BMJ Open Qual 2020; 9 (01) e000714
71 Magrabi F, Ammenwerth E, McNair JB. et al. Artificial intelligence in clinical decision support: challenges for evaluating AI and practical implications. Yearb Med Inform 2019; 28 (01) 128-134
72 Liao F, Adelaine S, Afshar M, Patterson BW. Governance of clinical AI applications to facilitate safe and equitable deployment in a large health system: key elements and early successes. Front Digit Health 2022; 4: 931439
73 Bossen C, Pine KH. Batman and Robin in healthcare knowledge work: human-AI collaboration by clinical documentation integrity specialists. ACM Trans Comput Hum Interact 2023; 30 (02) 1-29
74 Zhang G, Jin Q, McInerney DJ. et al. Leveraging generative AI for clinical evidence summarization needs to achieve trustworthiness. arXiv [csAI] . Accessed November 19, 2023 at: http://arxiv.org/abs/2311.11211
75 Sezgin E, Sirrianni J, Linwood SL. Operationalizing and implementing pretrained, large artificial intelligence linguistic models in the US health care system: outlook of generative pretrained transformer 3 (GPT-3) as a service model. JMIR Med Inform 2022; 10 (02) e32875
76 Moy AJ, Schwartz JM, Chen R. et al. Measurement of clinical documentation burden among physicians and nurses using electronic health records: a scoping review. J Am Med Inform Assoc 2021; 28 (05) 998-1008
77 Schluter N. The limits of automatic summarisation according to rouge. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics; 2017 :41–45

Supplementary Material

Supplementary Material