CC BY-NC-ND 4.0 · Appl Clin Inform 2019; 10(04): 679-692
DOI: 10.1055/s-0039-1695793
Research Article
Georg Thieme Verlag KG Stuttgart · New York

Pan-European Data Harmonization for Biobanks in ADOPT BBMRI-ERIC

Sebastian Mate
1   Medical Centre for Information and Communication Technology, Universitätsklinikum Erlangen, Erlangen, Germany
,
Marvin Kampf
1   Medical Centre for Information and Communication Technology, Universitätsklinikum Erlangen, Erlangen, Germany
,
Wolfgang Rödle
2   Chair of Medical Informatics, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
,
Stefan Kraus
2   Chair of Medical Informatics, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
,
Rumyana Proynova
3   Medical Informatics in Translational Oncology, German Cancer Research Center, Heidelberg, Germany
,
Kaisa Silander
4   Genomics and Biobank Unit, Finnish National Institute for Health and Welfare, Helsinki, Finland
,
Lars Ebert
5   Federated Information Systems, German Cancer Research Center, Heidelberg, Germany
,
Martin Lablans
5   Federated Information Systems, German Cancer Research Center, Heidelberg, Germany
,
Christina Schüttler
2   Chair of Medical Informatics, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
,
Christian Knell
1   Medical Centre for Information and Communication Technology, Universitätsklinikum Erlangen, Erlangen, Germany
,
Niina Eklund
4   Genomics and Biobank Unit, Finnish National Institute for Health and Welfare, Helsinki, Finland
,
Michael Hummel
6   Institute of Pathology, Charité-Universitätsmedizin Berlin, Berlin, Germany
7   Biobanking and BioMolecular Resources Research Infrastructure (BBMRI-ERIC), Graz, Austria
,
Petr Holub
7   Biobanking and BioMolecular Resources Research Infrastructure (BBMRI-ERIC), Graz, Austria
,
Hans-Ulrich Prokosch
1   Medical Centre for Information and Communication Technology, Universitätsklinikum Erlangen, Erlangen, Germany
2   Chair of Medical Informatics, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
› Author Affiliations
Funding The present work has been co-funded by ADOPT BBMRI-ERIC supported by EU Horizon 2020, grant agreement no. 676550. It was performed in (partial) fulfillment of the requirements for obtaining the degree “Dr. rer. biol. hum.” from the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) (SM).
Further Information

Publication History

13 February 2019

12 July 2019

Publication Date:
11 September 2019 (online)

Abstract

Background High-quality clinical data and biological specimens are key for medical research and personalized medicine. The Biobanking and Biomolecular Resources Research Infrastructure-European Research Infrastructure Consortium (BBMRI-ERIC) aims to facilitate access to such biological resources. The accompanying ADOPT BBMRI-ERIC project kick-started BBMRI-ERIC by collecting colorectal cancer data from European biobanks.

Objectives To transform these data into a common representation, a uniform approach for data integration and harmonization had to be developed. This article describes the design and the implementation of a toolset for this task.

Methods Based on the semantics of a metadata repository, we developed a lexical bag-of-words matcher, capable of semiautomatically mapping local biobank terms to the central ADOPT BBMRI-ERIC terminology. Its algorithm supports fuzzy matching, utilization of synonyms, and sentiment tagging. To process the anonymized instance data based on these mappings, we also developed a data transformation application.

Results The implementation was used to process the data from 10 European biobanks. The lexical matcher automatically and correctly mapped 78.48% of the 1,492 local biobank terms, and human experts were able to complete the remaining mappings. We used the expert-curated mappings to successfully process 147,608 data records from 3,415 patients.

Conclusion A generic harmonization approach was created and successfully used for cross-institutional data harmonization across 10 European biobanks. The software tools were made available as open source.

Protection of Human and Animal Subjects

The experiments were performed using anonymized patient data. The authors therefore declare that this study was performed in compliance with the World Medical Association Declaration of Helsinki on Ethical Principles for Medical Research Involving Human Subjects.


 
  • References

  • 1 Debnath M, Prasad GBKS, Bisen PS. Molecular Diagnosis in the Post Genomic and Proteomic Era. In: Molecular Diagnostics: Promises and Possibilities. Dordrecht Heidelberg London New York: Springer; 2010: 520
  • 2 Lin Y, Chen J, Shen B. Interactions between genetics, lifestyle, and environmental factors for healthcare. Adv Exp Med Biol 2017; 1005: 167-191
  • 3 Futreal PA, Coin L, Marshall M. , et al. A census of human cancer genes. Nat Rev Cancer 2004; 4 (03) 177-183
  • 4 Reddy PH. Can diabetes be controlled by lifestyle activities?. Curr Res Diabetes Obes J 2017; 1 (04) x
  • 5 Yegambaram M, Manivannan B, Beach TG, Halden RU. Role of environmental contaminants in the etiology of Alzheimer's disease: a review. Curr Alzheimer Res 2015; 12 (02) 116-146
  • 6 Katsios C, Roukos DH. Individual genomes and personalized medicine: life diversity and complexity. Per Med 2010; 7 (04) 347-350
  • 7 Kinkorová J. Biobanks in the era of personalized medicine: objectives, challenges, and innovation: overview. EPMA J 2016; 7: 4
  • 8 van Ommen G-JB, Törnwall O, Bréchot C. , et al. BBMRI-ERIC as a resource for pharmaceutical and life science industries: the development of biobank-based Expert Centres. Eur J Hum Genet 2015; 23 (07) 893-900
  • 9 Proynova R, Alexandre D, Lablans M. , et al. A decentralized IT architecture for locating and negotiating access to biobank samples. Stud Health Technol Inform 2017; 243: 75-79
  • 10 Lablans M, Kadioglu D, Mate S, Leb I, Prokosch H-U, Ückert F. Strategies for biobank networks. Classification of different approaches for locating samples and an outlook on the future within the BBMRI-ERIC [in German]. Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz 2016; 59 (03) 373-378
  • 11 Lablans M, Kadioglu D, Muscholl M, Ückert F. Exploiting distributed, heterogeneous and sensitive data stocks while maintaining the owner's data sovereignty. Methods Inf Med 2015; 54 (04) 346-352
  • 12 Schröder C, Heidtke KR, Zacherl N, Zatloukal K, Taupitz J. Safeguarding donors' personal rights and biobank autonomy in biobank networks: the CRIP privacy regime. Cell Tissue Bank 2011; 12 (03) 233-240
  • 13 Litton J-E. Launch of an infrastructure for health research: BBMRI-ERIC. Biopreserv Biobank 2018
  • 14 Torre LA, Bray F, Siegel RL, Ferlay J, Lortet-Tieulent J, Jemal A. Global cancer statistics, 2012. CA Cancer J Clin 2015; 65 (02) 87-108
  • 15 Vuorio E. Networking Biobanks Throughout Europe: The Development of BBMRI-ERIC. In: Hainaut P, Vaught J, Zatloukal K, Pasterk M. , eds. Biobanking of Human Biospecimens: Principles and Practice. Biobanking of Human Biospecimens: Principles and Practice. Cham: Springer; 2017: 137-153
  • 16 BBMRI-ERIC. BBMRI-ERIC Annual Report 2017. bbmri-eric.eu. 2017
  • 17 Sellis TK, Simitsis A. ETL Workflows: From Formal Specification to Optimization. In: Ioannidis Y, Novikov B, Rachev B. , eds. Advances in Databases and Information Systems. ADBIS 2007. Lecture Notes in Computer Science. Vol 4690. Berlin Heidelberg: Springer; 2007
  • 18 Simitsis A, Vassiliadis P, Sellis T. Optimizing ETL processes in data warehouses. Proc Int Conf Data Eng 2005; •••: 564-575
  • 19 Kimball R, Ross M. The Data Warehouse Toolkit-The Complete Guide to Dimensional Modeling. 2nd ed. Hoboken, NJ, USA: John Wiley & Sons; 2002
  • 20 Storf H, Schaaf J, Kadioglu D, Göbel J, Wagner TOF, Ückert F. Registries for rare diseases: OSSE - an open-source framework for technical implementation [in German]. Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz 2017; 60 (05) 523-531
  • 21 Kadioglu D, Weingardt P, Ückert F, Wagner T. Samply.MDR – Ein Open-Source-Metadaten-Repository. German Medical Science GMS Publishing House; 2016 . Available at: http://www.egms.de/static/de/meetings/gmds2016/16gmds149.shtml . Accessed August 8, 2019
  • 22 medinfo_mainz — Bitbucket [Internet]. bitbucket.org. Available at: https://bitbucket.org/medinfo_mainz/ Available at: September 11, 2018
  • 23 Mate S, Kadioglu D, Majeed RW. , et al. Proof-of-concept integration of heterogeneous biobank IT infrastructures into a hybrid biobanking network. Stud Health Technol Inform 2017; 243: 100-104
  • 24 Schlue D, Mate S, Haier J, Kadioglu D, Prokosch H-U, Breil B. From a content delivery portal to a knowledge management system for standardized cancer documentation. Stud Health Technol Inform 2017; 243: 180-184
  • 25 Prokosch H-U, Acker T, Bernarding J. , et al. MIRACUM: Medical Informatics in Research and Care in University Medicine. Methods Inf Med 2018; 57 (S 01): e82-e91
  • 26 Prokosch H-U. Datenmodellierung und Datenbankdesign für relationale Datenbanken. Software Kurier für Mediziner und Psychologen. 1991; 4: 39-45
  • 27 Nadkarni PM, Marenco L, Chen R, Skoufos E, Shepherd G, Miller P. Organization of heterogeneous scientific data using the EAV/CR representation. J Am Med Inform Assoc 1999; 6 (06) 478-493
  • 28 BBMRI-ERIC. ADOPT BBMRI-ERIC CCDC Terminology [Internet]. mdr.osse-register.de . Available at: https://mdr.osse-register.de/view.xhtml?namespace=ccdg . Accessed January 2019
  • 29 List of medical abbreviations - Wikipedia [Internet]. en.wikipedia.org . Available at: https://en.wikipedia.org/wiki/List_of_medical_abbreviations . Accessed June 17, 2019
  • 30 Brownlee J. A Gentle Introduction to the Bag-of-Words Model [Internet]. machinelearningmastery.com. 2017 . Available at: https://machinelearningmastery.com/gentle-introduction-bag-words-model/ . Accessed September 5, 2018
  • 31 Jurafsky D, Martin JH. Speech and Language Processing: an Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 2nd ed. Upper Saddle River, NJ, USA: Pearson Prentice Hall; 2009
  • 32 Levenshtein V. Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 1966; 10 (08) 707
  • 33 Allen G, Owens M. The Definitive Guide to SQLite. 2nd ed. Berkely, CA, USA: Apress; 2010
  • 34 Rahm E, Bernstein PA. A survey of approaches to automatic schema matching. VLDB J 2001; 10 (04) 334-350
  • 35 Bernstein PA, Melnik S, Churchill SE. Incremental Schema Matching. Proceedings of the 32nd International Conference on Very Large Data Bases. VLDB Endowment; 2006 :1167–1170
  • 36 Engmann D, Massmann S. Instance Matching with COMA + +. BTW Workshops; 2007
  • 37 Papotti P, Torlone R. Schema Exchange: Generic Mappings for Transforming Data and Metadata. Data Knowl Eng 2009; 68 (07) 665-682
  • 38 Bernstein PA, Madhavan J, Rahm E. Generic Schema Matching, Ten Years Later. Proceedings of the VLDB Endowment 2011; 4 (11) 695-701
  • 39 Aleksovski Z, Klein M, Kate ten W, van Harmelen F. Matching Unstructured Vocabularies Using a Background Ontology. In: Staab S, Svátek V. , eds. Managing Knowledge in a World of Networks. EKAW 2006. Lecture Notes in Computer Science. Vol 4248. Berlin Heidelberg: Springer; 2006: 182-197
  • 40 Yu AC. Methods in biomedical ontology. J Biomed Inform 2006; 39 (03) 252-266
  • 41 Zhang M, Hadjieleftheriou M, Ooi BC, Procopiuc CM, Srivastava D. Automatic discovery of attributes in relational databases. SIGMOD Conference; 2011
  • 42 Otero-Cerdeira L, Rodríguez-Martínez FJ, Gómez-Rodríguez A. Ontology Matching: A Literature Review. Expert Syst Appl 2015; 42 (02) 949-971
  • 43 Euzenat J, Shvaiko P. Ontology Matching. 2nd ed. Berlin Heidelberg: Springer; 2013
  • 44 Achichi M, Cheatham M, Dragisic Z. , et al. Results of the Ontology Alignment Evaluation Initiative 2017. In: Proceedings of the 12th International Workshop on Ontology Matching co-located with the 16th International Semantic Web Conference (ISWC 2017); 2017:61–113
  • 45 Pang C, Hendriksen D, Dijkstra M. , et al. BiobankConnect: software to rapidly connect data elements for pooled analysis across biobanks using ontological and lexical indexing. J Am Med Inform Assoc 2015; 22 (01) 65-75
  • 46 Kock A-K, Bruland P, Kadioglu D. Mappathon - A Metadata Mapping Challenge for Secondary Use. GMDS 2018 [Internet]. 2018 August 27;1–2. Available at: https://www.egms.de/static/en/meetings/gmds2018/18gmds192.shtml . Accessed August 8, 2019
  • 47 Pang C, Kelpin F, van Enckevort D. , et al. BiobankUniverse: automatic matchmaking between datasets for biobank data discovery and integration. Bioinformatics 2017; 33 (22) 3627-3634
  • 48 Noy NF, Shah NH, Whetzel PL. , et al. BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res 2009; 37 (Web Server issue): W170-3
  • 49 Bodenreider O, Nelson SJ, Hole WT, Chang HF. Beyond synonymy: exploiting the UMLS semantics in mapping vocabularies. Proc AMIA Symp 1998; •••: 815-819
  • 50 Jupp S, Liener T, Sarntivijai S, Vrousgou O, Burdett T, Parkinson HE. OxO - A Gravy of Ontology Mapping Extracts. In: Proceedings of the 8th International Conference on Biomedical Ontology (ICBO 2017); 2017
  • 51 U.S. National Library of Medicine. SNOMED CT to ICD-10-CM Map [Internet]. nlm.nih.gov . U.S. National Library of Medicine. Available at: https://www.nlm.nih.gov/research/umls/mapping_projects/snomedct_to_icd10cm.html . Accessed May 4, 2019
  • 52 Mate S, Köpcke F, Toddenroth D. , et al. Ontology-based data integration between clinical and research systems. PLoS One 2015; 10 (01) e0116656
  • 53 Schüttler C, Buschhüter N, Döllinger C. , et al. Requirements for a cross-location biobank IT infrastructure : Survey of stakeholder input on the establishment of a biobank network of the German Biobank Alliance (GBA) [in German]. Pathologe 2018; 39 (04) 289-296
  • 54 McDonald CJ, Huff SM, Suico JG. , et al. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin Chem 2003; 49 (04) 624-633