Examining the Generalizability of Pretrained De-identification Transformer Models on Narrative Nursing Notes

Fangyi Chen; Syed Mohtashim Abbas Bokhari; Kenrick Cato; Gamze Gürsoy; Sarah Rossetti

doi:10.1055/a-2282-4340

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035026.xml

Download PDF

CC BY-NC-ND 4.0 · Appl Clin Inform 2024; 15(02): 357-367
DOI: 10.1055/a-2282-4340

Research Article

Examining the Generalizability of Pretrained De-identification Transformer Models on Narrative Nursing Notes

Authors

Fangyi Chen

¹Department of Biomedical Informatics, Columbia University, New York, New York, United States
Syed Mohtashim Abbas Bokhari

¹Department of Biomedical Informatics, Columbia University, New York, New York, United States
Kenrick Cato

²School of Nursing, University of Pennsylvania, Philadelphia, Pennsylvania, United States

³School of Nursing, Columbia University, New York, New York, United States
Gamze Gürsoy

¹Department of Biomedical Informatics, Columbia University, New York, New York, United States
Sarah Rossetti

¹Department of Biomedical Informatics, Columbia University, New York, New York, United States

³School of Nursing, Columbia University, New York, New York, United States

Funding This study was supported and funded by the National Institute of Nursing Research (1R01NR016941) and the American Nurses Foundation (ANF) Reimaging Nursing Initiative. The authors are solely responsible for the content of this work, and it does not necessarily reflect the official view of the National Institutes of Health.

Further Information

Also available at

Permissions and Reprints

Abstract

Background Narrative nursing notes are a valuable resource in informatics research with unique predictive signals about patient care. The open sharing of these data, however, is appropriately constrained by rigorous regulations set by the Health Insurance Portability and Accountability Act (HIPAA) for the protection of privacy. Several models have been developed and evaluated on the open-source i2b2 dataset. A focus on the generalizability of these models with respect to nursing notes remains understudied.

Objectives The study aims to understand the generalizability of pretrained transformer models and investigate the variability of personal protected health information (PHI) distribution patterns between discharge summaries and nursing notes with a goal to inform the future design for model evaluation schema.

Methods Two pretrained transformer models (RoBERTa, ClinicalBERT) fine-tuned on i2b2 2014 discharge summaries were evaluated on our data inpatient nursing notes and compared with the baseline performance. Statistical testing was deployed to assess differences in PHI distribution across discharge summaries and nursing notes.

Results RoBERTa achieved the optimal performance when tested on an external source of data, with an F1 score of 0.887 across PHI categories and 0.932 in the PHI binary task. Overall, discharge summaries contained a higher number of PHI instances and categories of PHI compared with inpatient nursing notes.

Conclusion The study investigated the applicability of two pretrained transformers on inpatient nursing notes and examined the distinctions between nursing notes and discharge summaries concerning the utilization of personal PHI. Discharge summaries presented a greater quantity of PHI instances and types when compared with narrative nursing notes, but narrative nursing notes exhibited more diversity in the types of PHI present, with some pertaining to patient's personal life. The insights obtained from the research help improve the design and selection of algorithms, as well as contribute to the development of suitable performance thresholds for PHI.

Keywords

nursing notes - i2b2 - discharge summaries - de-identification - transformers - NLP

Protection of Human and Animal Subjects

The study was approved by institutional review boards.

Publication History

Received: 01 December 2023

Accepted: 15 February 2024

Accepted Manuscript online:
06 March 2024

Article published online:
08 May 2024

© 2024. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

References
1 National Institutes of Health. Final NIH policy for data management and sharing. Accessed at https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html

Download RIS citation
2 Adler-Milstein J, Holmgren AJ, Kralovec P, Worzala C, Searcy T, Patel V. Electronic health record adoption in US hospitals: the emergence of a digital “advanced use” divide. J Am Med Inform Assoc 2017; 24 (06) 1142-1148

Crossref PubMed Search in Google Scholar
Download RIS citation
3 Kong HJ. Managing unstructured big data in healthcare system. Healthc Inform Res 2019; 25 (01) 1-2

Crossref PubMed Search in Google Scholar
Download RIS citation
4 HIT Consultant. Why unstructured data holds the key to intelligent healthcare systems [Internet]. Atlanta, GA: HIT Consultant. 2015. . Accessed March 7, 2024 at: https://hitconsultant.net/2015/03/31/tapping-unstructured-data-healthcares-biggest-hurdle-realized/

Search in Google Scholar
Download RIS citation
5 Tayefi M, Ngo P, Chomutare T. et al. Challenges and opportunities beyond structured data in analysis of electronic health records. Wiley Interdiscip Rev Comput Stat 2021; 13 (06) e1549

Crossref Search in Google Scholar
Download RIS citation
6 Schwalbe N, Wahl B, Song J, Lehtimaki S. Data sharing and global public health: defining what we mean by data. Front Digit Health 2020; 2: 612339

Crossref PubMed Search in Google Scholar
Download RIS citation
7 Kang MJ, Dykes PC, Korach TZ. et al. Identifying nurses' concern concepts about patient deterioration using a standard nursing terminology. Int J Med Inform 2020; 133: 104016

Crossref PubMed Search in Google Scholar
Download RIS citation
8 Korach ZT, Yang J, Rossetti SC. et al. Mining clinical phrases from nursing notes to discover risk factors of patient deterioration. Int J Med Inform 2020; 135: 104053

Crossref PubMed Search in Google Scholar
Download RIS citation
9 Rossetti SC, Knaplund C, Albers D. et al. Healthcare process modeling to phenotype clinician behaviors for exploiting the signal gain of clinical expertise (HPM-ExpertSignals): development and evaluation of a conceptual framework. J Am Med Inform Assoc 2021; 28 (06) 1242-1251

Crossref PubMed Search in Google Scholar
Download RIS citation
10 Standards for privacy of individually identifiable health information final rule. 67. Federal Register 2002: 53181-53273

PubMed
Download RIS citation
11 Act A. Health insurance portability and accountability act of 1996. Public Law 1996; 104: 191

Search in Google Scholar
Download RIS citation
12 Friedlin FJ, McDonald CJ. A software tool for removing patient identifying information from clinical documents. J Am Med Inform Assoc 2008; 15 (05) 601-610

Crossref PubMed Search in Google Scholar
Download RIS citation
13 Neamatullah I, Douglass MM, Lehman LW. et al. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak 2008; 8 (01) 1-7

Crossref PubMed Search in Google Scholar
Download RIS citation
14 Beckwith BA, Mahaadevan R, Balis UJ, Kuo F. Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Med Inform Decis Mak 2006; 6: 12

Crossref PubMed Search in Google Scholar
Download RIS citation
15 Gupta D, Saul M, Gilbertson J. Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. Am J Clin Pathol 2004; 121 (02) 176-186

Crossref PubMed Search in Google Scholar
Download RIS citation
16 Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic lexicon. Proc AMIA Symp 2000; 729-733

PubMed Search in Google Scholar
Download RIS citation
17 Norgeot B, Muenzen K, Peterson TA. et al. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. NPJ Digit Med 2020; 3 (01) 57

Crossref PubMed Search in Google Scholar
Download RIS citation
18 Szarvas G, Farkas R, Kocsor A. A multilingual named entity recognition system using boosting and c4. 5 decision tree learning algorithms. In: Discovery Science: 9th International Conference, DS 2006, Barcelona, Spain, October 7–10, 2006. Berlin Heidelberg:: Springer; 2006: 267-278

Search in Google Scholar
Download RIS citation
19 Aramaki E, Imai T, Miyo K, Ohe K. Automatic deidentification by using sentence features and label consistency. In: Proceedings i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data,. 2006, November 10 2006; Washington, DC: i2b2: 10-11

Download RIS citation
20 Wellner B, Huyck M, Mardis S. et al. Rapidly retargetable approaches to de-identification in medical records. J Am Med Inform Assoc 2007; 14 (05) 564-573

Crossref PubMed Search in Google Scholar
Download RIS citation
21 Guo Y, Gaizauskas R, Roberts I, Demetriou G, Hepple M. Identifying personal health information using support vector machines. In: Proceedings i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data, 2006. November 10:10–11

Download RIS citation
22 Gardner J, Xiong L. An integrated framework for de-identifying unstructured medical data. Data Knowl Eng 2009; 68 (12) 1441-1451

Crossref Search in Google Scholar
Download RIS citation
23 Hartman T, Howell MD, Dean J. et al. Customization scenarios for de-identification of clinical notes. BMC Med Inform Decis Mak 2020; 20 (01) 1-9

Crossref PubMed Search in Google Scholar
Download RIS citation
24 Uzuner Ö, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc 2007; 14 (05) 550-563

Crossref PubMed Search in Google Scholar
Download RIS citation
25 Johnson AE, Bulgarelli L, Pollard TJ. Deidentification of free-text medical records using pre-trained bidirectional transformers. In: Proceedings of the ACM Conference on Health, Inference, and Learning,. 2020, April 2:214–221

Download RIS citation
26 Akbik A, Blythe D, Vollgraf R. Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics,. 2018, August: 1638-1649

Download RIS citation
27 Ferrández O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. J Am Med Inform Assoc 2013; 20 (01) 77-83

Crossref PubMed Search in Google Scholar
Download RIS citation
28 Yang H, Garibaldi JM. Automatic detection of protected health information from clinic narratives. J Biomed Inform 2015; 58: S30-S38

Crossref PubMed Search in Google Scholar
Download RIS citation
29 Liu Z, Chen Y, Tang B. et al. Automatic de-identification of electronic medical records using token-level and character-level conditional random fields. J Biomed Inform 2015; 58: S47-S52

Crossref PubMed Search in Google Scholar
Download RIS citation
30 Khin K, Burckhardt P, Padman R. A deep learning architecture for de-identification of patient notes: implementation and evaluation. arXiv preprint arXiv:1810.01570. 2018 October 3. Accessed March 7, 2024 at: https://doi.org/10.48550/arXiv.1810.01570

Crossref Search in Google Scholar
Download RIS citation
31 Rizvi RF, Harder KA, Hultman GM. et al. A comparative observational study of inpatient clinical note-entry and reading/retrieval styles adopted by physicians. Int J Med Inform 2016; 90: 1-1

Crossref PubMed Search in Google Scholar
Download RIS citation
32 Stubbs A, Uzuner Ö. Annotating longitudinal clinical narratives for de-identification: the 2014 i2b2/UTHealth corpus. J Biomed Inform 2015; 58: S20-S29

Crossref PubMed Search in Google Scholar
Download RIS citation
33 Casola S, Lauriola I, Lavelli A. Pre-trained transformers: an empirical comparison. Mach Learn Appl 2022; 9: 100334

Search in Google Scholar
Download RIS citation
34 Liu Y, Ott M, Goyal N. et al. Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. 2019 July 26. Accessed March 7, 2024 at: https://doi.org/10.48550/arXiv.1907.11692

Crossref Search in Google Scholar
Download RIS citation
35 Alsentzer E, Murphy JR, Boag W. et al. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323. 2019 April 6. Accessed March 7, 2024 at: https://doi.org/10.48550/arXiv.1904.03323

Crossref Search in Google Scholar
Download RIS citation
36 Syed MAB, Krstovski K, Withall J. et al. Heuristic-based extraction and unigram analysis of nursing free text data residing in large EHR clinical notes. In: Proceedings of the 17th EAI International Conference on Pervasive Computing Technologies for Healthcare 2023 held on November 27–29,. 2023 , in Malmö, Sweden

PubMed Search in Google Scholar
Download RIS citation
37 Kailas P, Goto S, Homilius M, MacRae CA, Deo RC. . obi-ml-public/ehr_deidentification (0.1.0b). Zenodo 2022. Accessed March 7, 2024 at: https://doi.org/10.5281/zenodo.6617957

Crossref
Download RIS citation
38 Trienes J, Trieschnigg D, Seifert C, Hiemstra D. Comparing rule-based, feature-based and deep neural methods for de-identification of dutch medical records. arXiv preprint arXiv:2001.05714. 2020 Jan 16. Accessed March 7, 2024 at: https://doi.org/10.48550/arXiv.2001.05714

Crossref Search in Google Scholar
Download RIS citation
39 Adnan M, Warren J, Orr M. Assessing text characteristics of electronic discharge summaries and their implications for patient readability. In: Proceedings of the Fourth Australasian Workshop on Health Informatics and Knowledge Management,. January 1, 2010; Vol. 108, pp. 77–84

Download RIS citation
40 Dai H, Liu Z, Liao W. et al. Chataug: Leveraging chatgpt for text data augmentation. arXiv preprint arXiv:2302.13007. 2023 February 25. Accessed March 7, 2024 at: https://doi.org/10.48550/arXiv.2302.13007

Crossref Search in Google Scholar
Download RIS citation
41 Liu Z, Yu X, Zhang L. et al. Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv preprint arXiv:2303.11032. 2023 Mar 20. Accessed March, 2024 at: https://doi.org/10.48550/arXiv.2303.11032

Crossref Search in Google Scholar
Download RIS citation
42 Carrell D, Malin B, Aberdeen J. et al. Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text. J Am Med Inform Assoc 2013; 20 (02) 342-348

Crossref PubMed Search in Google Scholar
Download RIS citation
43 Chambon PJ, Wu C, Steinkamp JM, Adleberg J, Cook TS, Langlotz CP. Automated deidentification of radiology reports combining transformer and “hide in plain sight” rule-based methods. J Am Med Inform Assoc 2023; 30 (02) 318-328

Crossref PubMed Search in Google Scholar
Download RIS citation
44 Rothstein MA. Is deidentification sufficient to protect health privacy in research?. Am J Bioeth 2010; 10 (09) 3-11

Crossref PubMed Search in Google Scholar
Download RIS citation

Related Journals

Subscribe to RSS

Share / Bookmark

Examining the Generalizability of Pretrained De-identification Transformer Models on Narrative Nursing Notes

Authors

Abstract

Keywords

Protection of Human and Animal Subjects

Publication History

References