RSS-Feed abonnieren
DOI: 10.1055/a-2327-4121
Evaluation of a Digital Scribe: Conversation Summarization for Emergency Department Consultation Calls
Funding The project described was supported by Award Number UM1TR004548 from the National Center for Advancing Translational Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Center for Advancing Translational Sciences or the National Institutes of Health.
Abstract
Objectives We present a proof-of-concept digital scribe system as an emergency department (ED) consultation call-based clinical conversation summarization pipeline to support clinical documentation and report its performance.
Methods We use four pretrained large language models to establish the digital scribe system: T5-small, T5-base, PEGASUS-PubMed, and BART-Large-CNN via zero-shot and fine-tuning approaches. Our dataset includes 100 referral conversations among ED clinicians and medical records. We report the ROUGE-1, ROUGE-2, and ROUGE-L to compare model performance. In addition, we annotated transcriptions to assess the quality of generated summaries.
Results The fine-tuned BART-Large-CNN model demonstrates greater performance in summarization tasks with the highest ROUGE scores (F1ROUGE-1 = 0.49, F1ROUGE-2 = 0.23, F1ROUGE-L = 0.35) scores. In contrast, PEGASUS-PubMed lags notably (F1ROUGE-1 = 0.28, F1ROUGE-2 = 0.11, F1ROUGE-L = 0.22). BART-Large-CNN's performance decreases by more than 50% with the zero-shot approach. Annotations show that BART-Large-CNN performs 71.4% recall in identifying key information and a 67.7% accuracy rate.
Conclusion The BART-Large-CNN model demonstrates a high level of understanding of clinical dialogue structure, indicated by its performance with and without fine-tuning. Despite some instances of high recall, there is variability in the model's performance, particularly in achieving consistent correctness, suggesting room for refinement. The model's recall ability varies across different information categories. The study provides evidence toward the potential of artificial intelligence-assisted tools in assisting clinical documentation. Future work is suggested on expanding the research scope with additional language models and hybrid approaches and comparative analysis to measure documentation burden and human factors.
Keywords
text summarization - emergency department - clinical conversation - pretrained language model - documentation burdenHuman Subject Protection of Human and Animal Subjects
No human subjects were involved in the study.
Publikationsverlauf
Eingereicht: 08. Januar 2024
Angenommen: 14. Mai 2024
Accepted Manuscript online:
15. Mai 2024
Artikel online veröffentlicht:
24. Juli 2024
© 2024. Thieme. All rights reserved.
Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany
-
References
- 1 Quiroz JC, Laranjo L, Kocaballi AB, Berkovsky S, Rezazadegan D, Coiera E. Challenges of developing a digital scribe to reduce clinical documentation burden. NPJ Digit Med 2019; 2: 114
- 2 Chandawarkar A, Chaparro JD. Burnout in clinicians. Curr Probl Pediatr Adolesc Health Care 2021; 51 (11) 101104
- 3 Joukes E, Abu-Hanna A, Cornet R, de Keizer NF. Time spent on dedicated patient care and documentation tasks before and after the introduction of a structured and standardized electronic health record. Appl Clin Inform 2018; 9 (01) 46-53
- 4 Moukarzel A, Michelet P, Durand AC. et al. Burnout syndrome among emergency department staff: prevalence and associated factors. BioMed Res Int 2019; 2019: 6462472
- 5 Moy AJ, Hobensack M, Marshall K. et al. Understanding the perceived role of electronic health records and workflow fragmentation on clinician documentation burden in emergency departments. J Am Med Inform Assoc 2023; 30 (05) 797-808
- 6 Morley C, Unwin M, Peterson GM, Stankovich J, Kinsman L. Emergency department crowding: a systematic review of causes, consequences and solutions. PLoS One 2018; 13 (08) e0203316
- 7 Kelen GD, Wolfe R, D'Onofrio G, Mills AM. Emergency department crowding: the canary in the health care system. . NEJM Catal 2021;2(05):
- 8 Colicchio TK, Cimino JJ, Del Fiol G. Unintended consequences of nationwide electronic health record adoption: challenges and opportunities in the post-meaningful use era. J Med Internet Res 2019; 21 (06) e13313
- 9 Reich J. The physician's view: healthcare digital transformation priorities and challenges. In: Hübner UH, Mustata WilsonG, Morawski TS, Ball MJ. eds. Nursing Informatics: A Health Informatics, Interprofessional and Global Perspective. Springer International Publishing;; 2022: 57-67
- 10 Holmgren AJ, Downing NL, Bates DW. et al. Assessment of electronic health record use between US and non-US health systems. JAMA Intern Med 2021; 181 (02) 251-259
- 11 Lavander P, Meriläinen M, Turkki L. Working time use and division of labour among nurses and health-care workers in hospitals - a systematic review. J Nurs Manag 2016; 24 (08) 1027-1040
- 12 Harris DA, Haskell J, Cooper E, Crouse N, Gardner R. Estimating the association between burnout and electronic health record-related stress among advanced practice registered nurses. Appl Nurs Res 2018; 43: 36-41
- 13 Wang J, Lavender M, Hoque E, Brophy P, Kautz H. A patient-centered digital scribe for automatic medical documentation. JAMIA Open 2021; 4 (01) ooab003
- 14 Shanafelt TD, Dyrbye LN, Sinsky C. et al. Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction. Mayo Clin Proc 2016; 91 (07) 836-848
- 15 Health worker burnout. Accessed November 27, 2023 at: https://www.hhs.gov/surgeongeneral/priorities/health-worker-burnout/index.html
- 16 AMIA 25 × 5. AMIA - American Medical Informatics Association. Accessed November 27, 2023 at: https://amia.org/about-amia/amia-25×5
- 17 25 by 5: Columbia leads symposium, ongoing efforts to reduce documentation burden on U.S. clinicians. Columbia DBMI. Accessed November 27, 2023 at: https://www.dbmi.columbia.edu/25×5/
- 18 Lin SY, Shanafelt TD, Asch SM. Reimagining clinical documentation with artificial intelligence. Mayo Clin Proc 2018; 93 (05) 563-565
- 19 Luh JY, Thompson RF, Lin S. Clinical documentation and patient care using artificial intelligence in radiation oncology. J Am Coll Radiol 2019; 16 (9 Pt B): 1343-1346
- 20 Bohr A, Memarzadeh K. Chapter 2 - The rise of artificial intelligence in healthcare applications. In: Bohr A, Memarzadeh K. eds. Artificial Intelligence in Healthcare. Academic Press;; 2020: 25-60
- 21 van Buchem MM, Boosman H, Bauer MP, Kant IMJ, Cammel SA, Steyerberg EW. The digital scribe in clinical practice: a scoping review and research agenda. NPJ Digit Med 2021; 4 (01) 57
- 22 Coiera E, Kocaballi B, Halamka J, Laranjo L. The digital scribe. NPJ Digit Med 2018; 1: 58
- 23 Goodwin TR, Savery ME, Demner-Fushman D. Flight of the PEGASUS? Comparing transformers on few-shot and zero-shot multi-document abstractive summarization. Proc Int Conf Comput Ling 2020; 2020: 5640-5646
- 24 Tierney AA, Gregg G, Brian H. et al. Ambient artificial intelligence scribes to alleviate the burden of clinical documentation. NEJM Catal 2024; 5 (03) CAT.23.0404
- 25 Rezazadegan D, Berkovsky S, Quiroz JC. et al. Automatic speech summarisation: a scoping review. arXiv [csCL] . Accessed August 27, 2020 at: http://arxiv.org/abs/2008.11897
- 26 Ghatnekar S, Faletsky A, Nambudiri VE. Digital scribe utility and barriers to implementation in clinical practice: a scoping review. Health Technol (Berl) 2021; 11 (04) 803-809
- 27 Zhang M, Zhou G, Yu W, Huang N, Liu W. A comprehensive survey of abstractive text summarization based on deep learning. Comput Intell Neurosci 2022; 2022: 7132226
- 28 Goyal T, Xu J, Li JJ, Durrett G. Training dynamics for text summarization models. arXiv [csCL] . Accessed October 15, 2021 at: http://arxiv.org/abs/2110.08370
- 29 Zhu C, Xu R, Zeng M, Huang X. A hierarchical network for abstractive meeting summarization with cross-domain pretraining. In: Cohn T, He Y, Liu Y. eds. Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics; 2020: 194-203
- 30 Nenkova A, McKeown K. Automatic Summarization. Foundations and Trends® in Information Retrieval. . 2011;5(2–3): 103-23310.1561/1500000015
- 31 Moratanch N, Chitrakala S. A survey on extractive text summarization. In: 2017 International Conference on Computer, Communication and Signal Processing (ICCCSP). IEEE; 2017: 1-6
- 32 Sotudeh S, Goharian N, Filice RW. Attend to medical ontologies: content selection for clinical abstractive summarization. arXiv [csCL] . Accessed May 1, 2020 at: http://arxiv.org/abs/2005.00163
- 33 Liu C, Wang P, Xu J, Li Z, Ye J. Automatic dialogue summary generation for customer service. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD '19. Association for Computing Machinery; 2019: 1957-1965
- 34 Lin H, Ng V. Abstractive summarization: a survey of the state of the art. AAAI 2019; 33 (01) 9815-9822
- 35 Brown T, Mann B, Ryder N. et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020; 33: 1877-1901
- 36 Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. . Adv Neural Inf Process Syst 2017;30
- 37 Lewis M, Liu Y, Goyal N. et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv [csCL] . Accessed October 29, 2019 at: http://arxiv.org/abs/1910.13461
- 38 Aghajanyan A, Gupta A, Shrivastava A, Chen X, Zettlemoyer L, Gupta S. Muppet: massive multi-task representations with pre-finetuning. arXiv [csCL] . Accessed January 26, 2021 at: http://arxiv.org/abs/2101.11038
- 39 Raffel C, Shazeer N, Roberts A. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 2020; 21 (01) 5485-5551
- 40 Chung HW, Hou L, Longpre S. et al. Scaling instruction-finetuned language models. arXiv [csLG] . Accessed October 20, 2022 at: http://arxiv.org/abs/2210.11416
- 41 Wang M, Wang M, Yu F, Yang Y, Walker J, Mostafa J. A systematic review of automatic text summarization for biomedical literature and EHRs. J Am Med Inform Assoc 2021; 28 (10) 2287-2297
- 42 Feng X, Feng X, Qin B. A survey on dialogue summarization: recent advances and new frontiers. arXiv [csCL] . Accessed July 7, 2021 at: http://arxiv.org/abs/2107.03175
- 43 Physician Direct Connect (PDC). Accessed November 21, 2023 at: https://www.nationwidechildrens.org/for-medical-professionals/refer-a-patient/referrals-and-scheduling/pdc
- 44 Epic. Accessed December 5, 2023 at: https://www.epic.com/
- 45 Amazon Web Services - Transcribe. . Accessed December 1, 2021 at: https://aws.amazon.com/transcribe/
- 46 Lewis M, Liu Y, Goyal N. et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2020: 7871-7880
- 47 Zhang J, Zhao Y, Saleh M, Liu P. PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. In: Iii HD, Singh A, eds. Proceedings of the 37th International Conference on Machine Learning. Vol. 119. Proceedings of Machine Learning Research; July 13–18, 2020: 11328-11339
- 48 Jia Z, Chen J, Xu X. et al. The importance of resource awareness in artificial intelligence for healthcare. Nat Mach Intell 2023; 5 (07) 687-698
- 49 Koch M, Arlandini C, Antonopoulos G. et al. HPC+ in the medical field: overview and current examples. Technol Health Care 2023; 31 (04) 1509-1523
- 50 Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv [csLG] . . Accessed November 14, 2017 at: http://arxiv.org/abs/1711.05101
- 51 Ecosystem. PyTorch. Accessed April 10, 2024 at: https://pytorch.org/ecosystem/
- 52 Models. Accessed April 10, 2024 at: https://huggingface.co/models
- 53 Python software. . Python.org. Accessed April 12, 2024 at: https://www.python.org/downloads/
- 54 Lin CY. ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out. Association for Computational Linguistics; 2004: 74-81
- 55 Liu S, McCoy AB, Wright AP. et al. Why do users override alerts? Utilizing large language model to summarize comments and optimize clinical decision support. J Am Med Inform Assoc 2024; 31 (06) 1388-1396
- 56 Van Veen D, Van Uden C, Blankemeier L. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med 2024; 30: 1134-1142
- 57 Cohen A, Kantor A, Hilleli S, Kolman E. Automatic rephrasing of transcripts-based action items. In: Zong C, Xia F, Li W, Navigli R. eds. Findings of the Association for Computational Linguistics. Association for Computational Linguistics; 2021. :2862–2873
- 58 Gao Y, Miller T, Xu D, Dligach D, Churpek MM, Afshar M. Summarizing patients' problems from hospital progress notes using pre-trained sequence-to-sequence models. Proc Int Conf Comput Ling 2022; 2022: 2979-2991
- 59 Gao J, Zhao H, Zhang Y, Wang W, Yu C, Xu R. Benchmarking large language models with augmented instructions for fine-grained information extraction. arXiv [csCL] . Accessed October 8, 2023 at: http://arxiv.org/abs/2310.05092
- 60 Nguyen QA, Duong QH, Nguyen MQ. et al. A hybrid multi-answer summarization model for the biomedical question-answering system. In: 2021 13th International Conference on Knowledge and Systems Engineering (KSE). IEEE; 2021: 1-6
- 61 Park J, Kotzias D, Kuo P. et al. Detecting conversation topics in primary care office visits from transcripts of patient-provider interactions. J Am Med Inform Assoc 2019; 26 (12) 1493-1504
- 62 Shortliffe EH, Sepúlveda MJ. Clinical decision support in the era of artificial intelligence. JAMA 2018; 320 (21) 2199-2200
- 63 Sezgin E. Artificial intelligence in healthcare: complementing, not replacing, doctors and healthcare providers. Digit Health 2023; 9: 20 552076231186520
- 64 Rousseau I, Fosse L, Dkhissi Y, Damnati G, Lecorvé G. Darbarer @ AutoMin2023. Transcription simplification for concise minute generation from multi-party conversations. International Conference on Natural Language Generation. Accessed December 1, 2023 at: https://www.semanticscholar.org/paper/3d8c3cd49045e8310174146e571fae7092c7a770
- 65 Nanayakkara G, Wiratunga N, Corsar D, Martin K, Wijekoon A. Clinical Dialogue Transcription Error Correction with Self-supervision. In: Artificial Intelligence XL. Springer Nature:; Switzerland: 2023: 33-46
- 66 Ganoe CH, Wu W, Barr PJ. et al. Natural language processing for automated annotation of medication mentions in primary care visit conversations. JAMIA Open 2021; 4 (03) ooab071
- 67 Smits M, Nacar M, DSLudden G, van Goor H. Stepwise design and evaluation of a values-oriented ambient intelligence healthcare monitoring platform. Value Health 2022; 25 (06) 914-923
- 68 Rao A, Pang M, Kim J. et al. Assessing the utility of ChatGPT throughout the entire clinical workflow: development and usability study. J Med Internet Res 2023; 25: e48659
- 69 Rudin RS, Perez S, Rodriguez JA. et al. User-centered design of a scalable, electronic health record-integrated remote symptom monitoring intervention for patients with asthma and providers in primary care. J Am Med Inform Assoc 2021; 28 (11) 2433-2444
- 70 McNab D, McKay J, Shorrock S, Luty S, Bowie P. Development and application of 'systems thinking' principles for quality improvement. BMJ Open Qual 2020; 9 (01) e000714
- 71 Magrabi F, Ammenwerth E, McNair JB. et al. Artificial intelligence in clinical decision support: challenges for evaluating AI and practical implications. Yearb Med Inform 2019; 28 (01) 128-134
- 72 Liao F, Adelaine S, Afshar M, Patterson BW. Governance of clinical AI applications to facilitate safe and equitable deployment in a large health system: key elements and early successes. Front Digit Health 2022; 4: 931439
- 73 Bossen C, Pine KH. Batman and Robin in healthcare knowledge work: human-AI collaboration by clinical documentation integrity specialists. ACM Trans Comput Hum Interact 2023; 30 (02) 1-29
- 74 Zhang G, Jin Q, McInerney DJ. et al. Leveraging generative AI for clinical evidence summarization needs to achieve trustworthiness. arXiv [csAI] . Accessed November 19, 2023 at: http://arxiv.org/abs/2311.11211
- 75 Sezgin E, Sirrianni J, Linwood SL. Operationalizing and implementing pretrained, large artificial intelligence linguistic models in the US health care system: outlook of generative pretrained transformer 3 (GPT-3) as a service model. JMIR Med Inform 2022; 10 (02) e32875
- 76 Moy AJ, Schwartz JM, Chen R. et al. Measurement of clinical documentation burden among physicians and nurses using electronic health records: a scoping review. J Am Med Inform Assoc 2021; 28 (05) 998-1008
- 77 Schluter N. The limits of automatic summarisation according to rouge. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics; 2017 :41–45