Pneumologie 2015; 69 - A64
DOI: 10.1055/s-0035-1556656

Data Harmonization and Data Integration Inside the Disease Area Lung Cancer of the German Center for Lung Research

D Firnkorn 1, M Ganzinger 1, T Muley 2, 3, T Michael 3, 4, P Knaup 1
  • 1Institute of Medical Biometry and Informatics, Heidelberg University
  • 2Translational Research Unit, Thoraxklinik at University Hospital Heidelberg
  • 3Translational Lung Research Centre Heidelberg (TLRC-H), Member of the German Centre for Lung Research (DZL)
  • 4Department of Oncology, Thoraxklinik at University Hospital Heidelberg

Introduction:

The disease area Lung Cancer (LuCa) of the German Center for Lung Research (DZL), aims to establish a shared case base for investigating lung cancer related cohorts. Due to the diversity of the source data elements (SDE) arising from the different clinics, a data harmonization process has to be performed to achieve common data elements (CDE) for semantic interoperability. We present necessary steps to perform data integration of a harmonized data structure for LuCa. This data structure and the underlying medical data is made available inside a research Data Warehouse (RDW) to enable joint data analysis.

Methods:

Major parameter domains must be established to define CDEs. Therefore, domain experts need to be involved within this process. A collection of SDEs matching the predefined parameter domains have to be gathered across the participating hospitals. A harmonization spreadsheet containing the SDEs of each site and the CDEs serves as basis for RDW. Talend Open Studio has been used to extract, transform and load (ETL) the raw data according to the mapping rules defined inside the spreadsheet. As RDW, we utilize i2b2 which provides an intuitive query tool for cohort identification.

Results:

We build a harmonized phenotype dataset for LuCa. We figured out ten major parameter domains with 302 defined CDEs. We were able to implement 285 mapping rules for the SDEs from the Thoraxklinik Heidelberg. Our ETL process extracts raw data of 2967 pseudonymized patients with over 270,000 clinical facts. The spreadsheet serves as basis for automatic configuration of the ETL process.

Discussion:

Currently, the local data managers of Großhansdorf and Munich are investigating the local data sources and list them up inside the harmonization table. Further efforts include the implementation and the connection of these data sources to achieve a common case base for LuCa within the general DZL RDW.

*Presenting author