CC BY-NC-ND 4.0 · Methods Inf Med 2019; 58(S 02): e72-e79
DOI: 10.1055/s-0039-3399579
Original Article
Georg Thieme Verlag KG Stuttgart · New York

One Step Away from Technology but One Step Towards Domain Experts—MDRBridge: A Template-Based ISO 11179-Compliant Metadata Processing Pipeline

Ann-Kristin Kock-Schoppenhauer
1   IT Center for Clinical Research, University of Lübeck, Germany
,
B. Kroll
1   IT Center for Clinical Research, University of Lübeck, Germany
,
M. Lambarki
2   Federated Information Systems, German Cancer Research Center, Heidelberg, Germany
,
H. Ulrich
1   IT Center for Clinical Research, University of Lübeck, Germany
,
S. Stahl-Toyota
3   Medical Informatics for Translational Oncology, German Cancer Research Center, Heidelberg, Germany
,
J.K. Habermann
4   Section for Translational Surgical Oncology and Biobanking, Department of Surgery, University of Lübeck & University Clinical Center Schleswig-Holstein, Campus Lübeck, Germany
5   Interdisciplinary Center for Biobanking-Lübeck (ICB-L), University of Lübeck, Germany
,
P. Duhm-Harbeck
1   IT Center for Clinical Research, University of Lübeck, Germany
,
J. Ingenerf
6   Institute of Medical Informatics, University of Lübeck, Germany
,
M. Lablans
2   Federated Information Systems, German Cancer Research Center, Heidelberg, Germany
› Author Affiliations
Further Information

Address for correspondence

Ann-Kristin Kock-Schoppenhauer, M.Sc.
IT Center for Clinical Research
Ratzeburger Allee 160, Lübeck 23562
Germany   

Publication History

19 November 2018

15 August 2019

Publication Date:
18 December 2019 (online)

 

Summary

Background Secondary use of routine medical data relies on a shared understanding of given information. This understanding is achieved through metadata and their interconnections, which can be stored in metadata repositories (MDRs). The necessity of an MDR is well understood, but the local work on metadata is a time-consuming and challenging process for domain experts.

Objective To support the identification, collection, and provision of metadata in a predefined structured manner to foster consolidation. A particular focus is placed on user acceptance.

Methods We propose a software pipeline MDRBridge as a practical intermediary for metadata capture and processing, based on MDRSheet, an ISO 11179–3 compliant template using popular spreadsheet software. It serves as a practical mediator for metadata acquisition and processing in a broader pipeline. Due to the different origins of the metadata, both manual entry and automatic extractions from application systems are supported. To enable the export of collected metadata into external MDRs, a mapping of ISO 11179 to Clinical Data Interchange Standards Consortium (CDISC) Operational Data Model (ODM) was developed.

Results MDRSheet is embedded in the processing pipeline MDRBridge and delivers metadata in the CDISC ODM format for further use in MDRs. This approach is used to interactively unify core datasets, import existing standard datasets, and automatically extract all defined data elements from source systems. The involvement of clinical domain experts improved significantly due to minimal changes within their usual work routine.

Conclusion A high degree of acceptance was achieved by adapting the working methods of clinical domain experts. The designed process is capable of transforming all relevant data elements according to the ISO 11179-3 format. MDRSheet is used as an intermediate format to present the information at a glance and to allow editing or supplementing by domain experts.


#

Introduction

Clinical research is highly dependent on the secondary use of electronic health record data, often supplemented by research data.[1] It is commonly understood that well-defined, unambiguous common data elements are a necessity for reproducible results.[2]

Due to a lot of variability for expressing the same content (e. g. “sex: {m, f}” vs. “gender: {m, f, unknown}” or “weight: real number [kg]” vs. “weight: {underweight, normal weight, overweight}”), technical tools such as metadata repositories (MDR) are used for maintaining, exchanging, reviewing, and querying common data elements, e.g., to support cross-institutional pooling of data.[3] Advanced functionalities like comparing, matching, and mapping of data elements enable the transformation of instance data.[4] Metadata can be supplied by clinical applications, selected within specific projects, or defined by domain experts.

Reaching consensus on a common dataset can be an ambitious task, in particular in large joint research infrastructures that integrate data across many sites or if the progress of the project is reliant on domain experts with a tight time budget, such as senior clinicians who conduct research in addition to their clinical routine. Their experience is crucial for obtaining consensus-driven dataset definitions. Therefore, barriers for the experts' contribution should be kept as low as possible. When these domain experts are obliged to use an unfamiliar, even if well-developed, application, their motivation to contribute is reduced. Instead, the process should accommodate their preferred working methods. We concur with Ngouongo et al that “[t]he definition of item collections could be simplified by a structured template appropriate for empirical medical research.”[5]

Experiences in practice show that for the management of data collections, in particular in the clinical context, spreadsheet applications such as Microsoft Excel are popular choices.[6] [7] Excel is not only used for instance data management but also for the management of schema-level and administrative data. For example, the widely used open source clinical trial software OpenClinica for electronic data capture offers an Excel template for creating electronic case report forms.[8] Against this background, a practice-approved tool is presented that enables and improves the provision of quality-assured metadata to enable cross-institutional integration of clinical and research data.


#

Objectives

This work aims to provide a process for domain experts to create, update, and consent datasets with minimum effort and accommodating their circumstances and preferred working methods. In doing so, an aim is also to create a possibility to incorporate existing datasets, such as from primary systems or core datasets. The developed tools and processes should make this dataset machine-readable and thus usable by components of research networks. A maximum of automation should be considered to minimize effort and error susceptibility.


#

Methods

We propose MDRSheet, a template for representing, editing, and visualization of metadata by humans as well as for automatic processing of metadata modeled based on ISO 11179 part 3. The template is embedded in a process pipeline called MDRBridge, which considers various sources for metadata as input as well as the export to various MDRs.

MDRSheet—An ISO 11179-Based Spreadsheet

ISO/IEC 11179 describes a method of standardization and registration of metadata to generate a general understanding and to enable data exchange between projects or systems.[9] ISO/IEC 11179 part 3 “Registry metamodel and basic attributes” is deemed the most relevant part to our work. The model can be subdivided into three parts (see [Fig. 1]): the conceptual layer, the representative layer, and the identifying layer.

Zoom Image
Fig. 1 Simplified core model of the ISO 11179–3, divided into the three parts of conceptual, representational, and identifying layers. In this approach, we look at the identifying layer and the representative layer, which are highlighted in gray in the graphic.

The ISO/IEC 11179–3 metamodel and its essential attributes are used in several MDR implementations like the Cancer Data Standards Registry and Repository (caDSR),[10] the Australian Metadata Online Registry (METeOR),[11] or the German Samply.MDR.[12] Due to its wide distribution and already existing implementations, it was decided to use the ISO standard as a basis for modeling the template.

MDRSheet was designed as the key component to enable domain experts to analyze and consolidate metadata. MDRSheet is a spreadsheet based on the ISO 11179 part 3 core model.[9] We identified the relevant parts of the model. The identifying and the representational layers of the ISO model were chosen. As our focus is set on practical application in research networks, the conceptual domain is not considered yet (cf.e [Fig. 2]). Relevant objects out of the model are represented as tabs in the MDRSheet (see [Fig. 3] for details); their attributes correspond to columns within the tabs. We defined Namespace, Data Element, and Value Domain of enumerations as tabs in the MDRSheet. As slots are linked to Data Elements, they are integrated into the Data Elements tab. Every Data Element is described by its designation and a definition either in English or other languages. The derivation for Data Elements could be organized by an additional tab, named Complex Data Elements. As the tabs follow a flat hierarchy and the core model is highly nested, additional foreign keys must be added. MDRSheet defines different common data types: String, Integer, Float, Enumeration, Boolean, Date, and Date Time. Some of these types require specific attributes such as regular expressions, Date Time formats, or norm values.

Zoom Image
Fig. 2 Simplified mapping of ISO 11179, MDRSheet, CDSIC ODM, and Samply.MDR, and the values on one level were identified as related.
Zoom Image
Fig. 3 A section of MDRSheet; it shows the data elements tab with the attributes: name, description, group membership, data type, lists, and regular expressions. The lower part shows a schematic representation of the different tabs from the MDRSheet.

Within a feedback loop, the designed MDRSheet was then verified in two ways: first, experts were interviewed to fit their needs in exchanging and discussing the metadata to the MDRSheet, and second, an MDR satisfying the ISO 11179 standard was analyzed. Both evaluations reached a positive result in the design of the MDRSheet and remarks could be successfully harmonized. The audit showed that it might be necessary to adapt the MDRSheet to project-specific needs. Therefore, an overview tab was added presenting general information about the MDRSheet, such as its version, title, and general explanations. Header names of all columns can be customized, and the MDRSheet may be extended by extra columns according to the project-specific requirements like the origin of the metadata. Customization (e. g., of headers) encourages projects to use their vocabulary and thus supports stakeholders to work with the MDRSheet.


#

MDRBridge—Generating the Extract, Transform, Load Process

A prototype of MDRBridge as a pipeline system based on MDRSheet as a mediation format is implemented. The transfer of metadata into an MDR instance consists of three steps:

  • The relevant metadata need to be identified and entered manually or extracted automatically from suitable source systems.

  • All relevant metadata are collected in MDRSheet and verified by domain experts.

  • Metadata are transformed into a suitable exchange format for MDRs like ISO 11179-based Samply.MDR or by means of Clinical Data Interchange Standards Consortium (CDISC) Operational Data Model (ODM); see [Fig. 4].

Zoom Image
Fig. 4 The MDRSheet-based pipeline for processing metadata. The data sources for metadata are either extracted automatically or entered directly into the MDR sheet. MDRSheet itself then provides data verification by the domain experts. Subsequently, a transformation to CDISC ODM takes place to be able to process the metadata further in accordance with the standard, e.g., into the Samply.MDR or other MDR implementations.

Input Formats

We identified three input sources, which require a conventional manner to collect metadata for reviewing within the MDRSheet. Following the bottom-up approach, one technique is the acquisition of manual input by domain experts. Filling the MDRSheet by inserting predefined datasets rather refers to the top-down approach. Source systems like data warehouses or clinical applications with predefined metadata refer to both approaches as they can contain core datasets as well as a broad variety of metadata. MDRSheet is filled by extracting metadata from such systems. The prototypical implementation can be used with both automatic and manual population of the MDRSheet. Database exports of locally used source systems can be supplemented by manual changes. Alternatively, metadata are entered directly into the MDRSheet.

Research Data Warehouse by the Example of CentraXX

CentraXX is a biobank and study management system,[13] which has become an integral part of the German research landscape for some years now. Currently, there are installations at 28 university hospitals in Germany.[14]

As a first step, all relevant items need to be identified. These elements need to be queried from the database: 24 tables providing master data (e.g., ischemia, blood group, country) were identified. These tables had to be combined with tables determining expressions for names and descriptions of the master data entries. Second, forms can also be created directly by a user in CentraXX. These forms are saved according to the Entity–Attribute–Value (EAV) schema. The tables are queried according to the MDRSheet so that as a result an MDRSheet with all defined metadata within the system is available. This data export is then enhanced by elements which are not present in the database but are inherent to the graphical user interface of CentraXX, so nearly 200 data elements were extracted manually. The process has been executed using an MS SQL database, Windows as an operating system, and CentraXX.


#

Manual Acquisition by Domain Experts and Predefined Datasets

Manual input can be required when no previously defined dataset is available or system independent metadata are required. Within the prototypical implementation, the MIABIS[15] and the Einheitlicher onkologischer Basisdatensatz ADT/GEKID[16] dataset have been entered manually in the MDRSheet.


#
#

Transformation Using MDRSheet

To reuse or repurpose the collected metadata, to make them available in projects, or simply to maintain them in one single repository, it is necessary to transform the metadata from the MDRSheet into a suitable format that can be imported by MDRs. To keep the MDRSheet interoperable for data exchange between a broader range of clinical data software systems, a converter for transformation was developed. In this implementation, we support the CDISC ODM format and the Samply.MDR format.

CDISC ODM

The ODM from CDISC is a standard for representing clinical trial data that support data and metadata interchange between heterogeneous systems.[17] The current version 1.3.2 of ODM, including the metadata component, was analyzed and mapping was conducted by matching the parameters and data types from the MDRSheet, respective ISO 11179, to those of the ODM standard. A mapping between the main elements of the MDRSheet to ODM is feasible,[3] and the results are presented in [Fig. 2].

The necessary transformation from the MDRSheet to ODM is implemented in the pipeline of the MDRBridge. Therefore, the MDRSheet is transformed to XML and converted to the CDISC ODM format by an XSL transformation, which allows standard representation.


#

MDR Export, Demonstrated by Using the Samply.MDR

As the ISO standard does not offer any implementation of its own, ODM can only be one possible exchange format. Using an ISO 11179-compliant MDR as an example, we show how the bridge can be established. The Samply.MDR[12] implemented at the University of Mainz is widely used in the field of health care projects, amongst others within the German Cancer Consortium (DKTK),[18] [19] Open-Source-Registersystem für Seltene Erkrankungen (OSSE),[20] the European BBMRI-ERIC project,[21] MI consortium MIRACUM,[22] and within the German Biobank Alliance (GBA).[23] The Samply.MDR provides interfaces for importing metadata by using an XML import format. As the mapping of ODM to the Samply.MDR format has already been shown in previous work,[24] the transformation of ODM to Samply.MDR is integrated within the MDRBridge by the standard procedure of defining an XSL transformation.


#
#
#

Quality Assurance and Technical Validation

Since the primary task of the MDRBridge is to collect, acquire, and present existing metadata, we consider data-quality assurance as a postprocessing task following the import into an MDR. As a result, MDRBridge explicitly allows possible redundancies because this reflects the reality of the bottom-up approach. However, MDRBridge is able to consider data-quality criteria as part of a more technical and less content-related validation. This technical validation is an integral part of the pipeline, i.e., basic quality criteria such as completeness, uniformity, and uniqueness are checked and detected errors are marked. The export format of the MDRSheet is validated against either the ODM schema or the Samply.MDR schema. Every invalid record is logged to an extra file for error detection. This validated data collection can then be used by domain experts for a review process; the results of which can be used, for example, to incrementally improve the source data quality. The validation happens prior to the import so that only valid data can enter the standard format. For validating the conversion process, a set of test cases covering all parameters and data types available in the MDRSheet was built and executed.


#
#

Results

Importing metadata into an MDR consists of several steps and human verification by domain experts. This process of extraction and/or collection of metadata, transforming, and loading them into an MDR requires a maximum of automation to reduce effort and transformation errors. The implementation of the MDRBridge was mainly realized using Talend Open Studio. Overall, we implemented three jobs written in Java:

  • Talend Job to extract metadata from source systems.

  • Talend Job to transform the metadata into a suitable format.

  • Java program to validate the XSLT input data.

Additionally, the jobs were combined in Java programs and can be controlled separately via graphical user interfaces; see [Fig. 5]. Hence, a fully automated pipeline has been established for transforming metadata from a source system into an MDR instance.

Zoom Image
Fig. 5 The graphical user interface for the associated ETL process of the MDRBridge. The paths to the different source files can be entered, the CentraXX installation parameterized and the result format adapted. ETL, extract, transform, load.

Utilization of MDRSheet with or without MDRBridge

MDRSheet could be successfully used within GBA and BBMRI-ERIC and finds application at the National Network Genomic Medicine (nNGM). Various standard datasets could be made available within an MDR; core datasets could be refined using the template within the nNGM. MDRBridge has been successfully accomplished at the University Medical Center Schleswig-Holstein, Campus Lübeck and at the German Cancer Research Center, Heidelberg. Moreover, the resulting java jobs can be applied at other research sites using CentraXX.


#
#

Discussion

Secondary use of data depends on data transformation. To define suitable transformation rules, it is necessary to collect the eligible data elements, to understand the meaning and context of the data elements and to find corresponding data elements. Recent research projects dealing with cross-site data sharing follow the top-down approach by defining a minimal or core dataset,[23] [25] storing the dataset in an MDR and implementing an on-site mapping of participating partners. Following this approach, the potential of diversity and complexity is a priori not fully exploited. In this work the bottom-up approach is favored, where all data elements of the source systems are entered into an MDR and harmonization can be applied in a centralized manner. In our experience the need to share instance data without first agreeing on a common dataset exists and will gain more importance due to the ongoing digitalization. Our approach supports this process by enabling the domain knowledge and researchers, as well as motivates engineers of MDRs to develop sophisticated methods to realize the exchange of instance data through the exchange of metadata. The extraction of all defined data elements from clinical research warehouses and the exemplary utilization of the implementation result in a proof of concept. Nonetheless, both bottom-up and top-down variations of metadata provision can be supported.

MDRSheet is especially suitable for input by domain experts, as it has been specially adapted to their needs. It can serve as a template and is presented in a familiar way, in which they can insert the data elements they deem essential and further use them in collaboration platforms of their choice.[26] Having all metadata visible on one sheet is an important factor, as the field of clinical research is often complex. When it comes to intellectual property in the case of metadata sovereignty and data protection, the spreadsheet conveys a sense of control.[27] [28] [29] The export out of such clinical research platforms is always a dedicated act and needs to be approved even if it does not concern patient data. For data-quality reasons, it may need to be manually supervised. As exports sometimes refer to a project-specific dataset and not all available data elements are necessary, the MDRSheet can at this point in the process be modified, exchanged, or consented to the needed degree.

We do not claim that spreadsheet solutions like Microsoft Excel are the best way to reach this consensus. In fact, specialized tools like the Clinical Knowledge Manager,[30] Dugas et al's Portal of Medical Models,[29] or ArtDecor[31] are clearly superior in terms of features, offering native support for change management, versioning, and medical coding systems. We do observe, however, that using a ubiquitous piece of software like Excel successfully mobilizes domain experts in the clinical domain who could not be convinced to use other tools. It is a recurring observation that Excel, despite its flaws, is successfully used in different projects.[7] [8] [32] [33] [34] This type of data storage offers many advantages, not only because of the good structured data acquisition properties of Excel, the high market penetration of Excel, the convincing degree of freedom in electronic data capture, and the reusability of the data.[35] Reasons for such a solution are that the spreadsheets are entirely independent and do not require any further installations, training, or configurations, so the hurdles are very low, and the representation enables immediate data assimilation and manipulation.[36]

Prior research shows that “ODM contains just enough information to unambiguously match the requirements of the ISO 11179, part 3 core data model.”[5] The transformation of metadata into ODM is an important step in contributing efforts to achieve interoperable health data exchange. Another medium-term advantage is to enable MDR users to use established electronic data capture systems for data collection and data management, as well as the use of standard instruments. Consequently, in MetaRep (ISO 21562),[37] an extension and clarification of ISO/IEC 11179 is under development to meet the requirements of health care. Our approach corresponds to the motivation of this standard: “While […] in health care we anticipate many more smaller registries detailing local implementations and standards selected from those that are available and thus the decision was made to specify which items were identified, classified, named, and administered to give implementers certainty when coping with content supplied by other metadata registries.” (MetaRep ISO 21562, Annexe A, p. 49). The possible utilization of the prototype MDRBridge is high. As mentioned, CentraXX plays a significant role in the German clinical research landscape, and so does the Samply.MDR as a target system.

On the one hand, research projects are confronted with strong time restrictions regarding the availability of physicians as domain experts. On the other hand, there is an increasing demand for harmonization and quality assurance of data with respect to requirements like pooling and reproducing of research results.

Due to high dynamics in various research projects with existing local terminology dialects and lack of governance, the majority of existing MDR systems are used by single projects, providing relevant variables in a bottom-up approach, linking local dialects to national or international standards, and harmonizing data elements between project partners. Somehow, in contrast, top-down approaches to metadata lead to the idea of core datasets with common data elements agreed centrally and relied on collaborative platforms and suitable governance.[3] [38] Both strategies, top-down and bottom-up, have their justification. Working with metadata, however, is urgently dependent on implementing uniform access to MDR content and existing MDR systems on a technical level. A lot of effort is invested in the features of MDRs. However, the retrieval of metadata is mostly unattended. Providing metadata is a task that demands a lot of domain knowledge and is typically performed in the clinical field by data stewards, knowledge engineers, physicians, or other medical staff. Collecting and coordinating entries is a time-consuming and challenging process.


#

Conclusion

The presented template MDRSheet steps away from other technologies and takes one step toward domain experts, who can bring in their invaluable knowledge with familiar tools. The provision, consentment, and collaboration in the field of metadata could be realized by achieving high user acceptance. A high degree of acceptance for the solution was achieved in particular by adapting the working method. Via the presented pipeline metadata, domain experts could be bridged. The transfer of metadata into an MDR is a time-consuming task; by applying the described MDRBridge, we shortened the required time substantially. The coordination phases can, therefore, be more targeted and faster. Other research site using CentraXX and Samply.MDR may benefit as well from this work. It can enable the MDR for upcoming applications and is a beneficial addition to the Samply community.


#
#

Conflict of Interest

None declared.

Acknowledgments

None.


Address for correspondence

Ann-Kristin Kock-Schoppenhauer, M.Sc.
IT Center for Clinical Research
Ratzeburger Allee 160, Lübeck 23562
Germany   


Zoom Image
Fig. 1 Simplified core model of the ISO 11179–3, divided into the three parts of conceptual, representational, and identifying layers. In this approach, we look at the identifying layer and the representative layer, which are highlighted in gray in the graphic.
Zoom Image
Fig. 2 Simplified mapping of ISO 11179, MDRSheet, CDSIC ODM, and Samply.MDR, and the values on one level were identified as related.
Zoom Image
Fig. 3 A section of MDRSheet; it shows the data elements tab with the attributes: name, description, group membership, data type, lists, and regular expressions. The lower part shows a schematic representation of the different tabs from the MDRSheet.
Zoom Image
Fig. 4 The MDRSheet-based pipeline for processing metadata. The data sources for metadata are either extracted automatically or entered directly into the MDR sheet. MDRSheet itself then provides data verification by the domain experts. Subsequently, a transformation to CDISC ODM takes place to be able to process the metadata further in accordance with the standard, e.g., into the Samply.MDR or other MDR implementations.
Zoom Image
Fig. 5 The graphical user interface for the associated ETL process of the MDRBridge. The paths to the different source files can be entered, the CentraXX installation parameterized and the result format adapted. ETL, extract, transform, load.