Evolution of a Graph Model for the OMOP Common Data Model

Mengjia Kang; Jose A. Alvarado-Guzman; Luke V. Rasmussen; Justin B. Starren

doi:10.1055/s-0044-1791487

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035026.xml

Download PDF

Appl Clin Inform 2024; 15(05): 1056-1065
DOI: 10.1055/s-0044-1791487

Research Article

Evolution of a Graph Model for the OMOP Common Data Model

Authors

Mengjia Kang

¹Division of Pulmonary and Critical Care Medicine, Department of Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, United States
Jose A. Alvarado-Guzman

²Neo4j, Inc., San Mateo, California, United States
Luke V. Rasmussen

³Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, United States
Justin B. Starren

³Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, United States

⁴University of Arizona Health Sciences, Tucson, Arizona, United States

Funding This work was supported by grant 5U19AI135964 from the National Institute of Allergy and Infectious Disease of the National Institutes of Health.

Further Information

Also available at

PDF Download Permissions and Reprints

Abstract

Objective Graph databases for electronic health record (EHR) data have become a useful tool for clinical research in recent years, but there is a lack of published methods to transform relational databases to a graph database schema. We developed a graph model for the Observational Medical Outcomes Partnership (OMOP) common data model (CDM) that can be reused across research institutions.

Methods We created and evaluated four models, representing two different strategies, for converting the standardized clinical and vocabulary tables of OMOP into a property graph model within the Neo4j graph database. Taking the Successful Clinical Response in Pneumonia Therapy (SCRIPT) and Collaborative Resource for Intensive care Translational science, Informatics, Comprehensive Analytics, and Learning (CRITICAL) cohorts as test datasets with different sizes, we compared two of the resulting graph models with respect to database performance including database building time, query complexity, and runtime for both cohorts.

Results Utilizing a graph schema that was optimized for storing critical information as topology rather than attributes resulted in a significant improvement in both data creation and querying. The graph database for our larger cohort, CRITICAL, can be built within 1 hour for 134,145 patients, with a total of 749,011,396 nodes and 1,703,560,910 edges.

Discussion To our knowledge, this is the first generalized solution to convert the OMOP CDM to a graph-optimized schema. Despite being developed for studies at a single institution, the modeling method can be applied to other OMOP CDM v5.x databases. Our evaluation with the SCRIPT and CRITICAL cohorts and comparison between the current and previous versions show advantages in code simplicity, database building, and query speed.

Conclusion We developed a method for converting OMOP CDM databases into graph databases. Our experiments revealed that the final model outperformed the initial relational-to-graph transformation in both code simplicity and query efficiency, particularly for complex queries.

Keywords

databases - general information systems and technologies in clinical settings - OMOP common data model - clinical data management - electronic health records and systems - clinical information systems

Background and Significance

Systems biology research often depends on combining multiple data types from the electronic health record (EHR) for clinical variables to multi-omics data sets. Modeling efforts within systems biology depend on the ability to represent and process the multitude of relationships among these concepts. While significant research has been done using the traditional relational database management systems (RDBMS), graph databases have emerged as a promising technology for enabling and optimizing certain analyses, including graph algorithms like Centrality, Community Detection, Path Finding, and Node Embeddings.[1] [2] Leveraging graph algorithms alongside other nongraph approaches provides new opportunities to gain novel insights into biological processes.

Graph databases have been used successfully with biological data sources.[3] Queries using the Reactome[4] system required 93% less time than its RDBMS counterpart. However, clinical data sources have not received as much attention in the graph database community. Some notable examples included claims data, medications, and disease interaction,[5] [6] but not a full integration of all clinical data elements (labs, diagnosis, medications, visits, and procedures). One of the challenges is that EHR data are not stored natively in a graph format, and data warehouses for clinical operations and research typically optimize the transactional database schema into an optimized RDBMS schema for analytics.[7] Furthermore, modeling data for use in a graph database require careful preparation. While a naïve row-to-node conversion is possible—each row is a node, each column is an attribute, and each foreign key is an edge—the resulting graph is typically attribute heavy, resulting in suboptimal performance. This is because graph database engines are typically optimized to query knowledge that is represented in the topology of the graph, rather than in the attributes.[5] [8]

Within the realm of biomedical research using EHR data, there has been a trend toward the use of common data models (CDMs). CDMs allow different organizations running different EHRs, or the same EHR configured differently, to share a common structure and semantics for how their EHR data are represented. Although it requires more work upfront to transform the EHR data into the CDM, over time it supports broader portability of work.[9] Among the CDMs, the Observational Medical Outcomes Partnership (OMOP) CDM[10] has emerged as the preferred choice for many national initiatives.[11] [12]

Previous efforts using graph databases have primarily focused on local or bespoke data models, with some preliminary work focused on a CDM approach.[13] For graph databases to be more accessible for biomedical research, developing an approach to transform a popular CDM into a graph structure would facilitate their adoption. While there have been multiple studies evaluating methods for converting and harmonizing various types of EHR data into the OMOP relational schema,[14] [15] [16] [17] [18] there has been relatively little work evaluating the conversion from OMOP to other schemas,[19] [20] especially in the case of graph schemas.

Objective

Our objective is to develop a conversion of OMOP CDM data into a graph schema that optimally leverages the unique capabilities of graph database engines. Toward this goal, we developed and evaluated a series of approaches for converting OMOP CDM data into a widely used graph database.[21] We sought to make a generalizable graph model that could be applied to a variety of OMOP instances.

Materials and Methods

Study Cohort

This work was conducted as part of the Successful Clinical Response in Pneumonia Therapy (SCRIPT) study[22]—a multiyear systems biology study integrating clinical, transcriptomic, metagenomic, and bacterial genomic data to support machine learning on host pathogen interaction and pneumonia episode outcomes.[23] The SCRIPT cohort was recruited and consented at the Northwestern Memorial Hospital (NMH) and included 590 participants as of March 2022. To scale up the evaluation of the graph model, we conducted identical tests on the Collaborative Resource for Intensive care Translational science, Informatics, Comprehensive Analytics, and Learning (CRITICAL) database, comprising 134,145 patients in NMH admitted to the intensive care unit (ICU) between January 1, 2002 and December 31, 2021. This work was reviewed and approved by the Northwestern University Institutional Review Board.

Graph Database Platform Selection

To identify our selected graph database system, we compared several NoSQL databases—primarily focusing on Azure Cosmos DB[24] and Neo4j.[25] Cosmos DB is a scalable, multimodel NoSQL database developed by Microsoft, and Neo4j is an open-source graph database with numerous native graph analytic capabilities. Cosmos DB is a cloud-based solution charging by usage, although institutions may be able to get free credits to use it. Neo4j offers for download a free community version, allowing broader dissemination of this work. Our team found the Cypher query language used by Neo4j to be intuitive and well supported; also, Neo4j's database performance met our research needs. Cosmos DB had an overall lack of visualization features, and it uses the Gremlin API, which our team felt had a longer learning curve and had less community support being newer. Additionally, we evaluated other graph-based visualization tools like Cytoscape[26] and Gephi,[27] but found them lacking in the data storage and query features we need.

Data Source

Data were prepared by the Northwestern Medicine Enterprise Data Warehouse (NMEDW)[28]—a joint effort between the Northwestern University Feinberg School of Medicine and Northwestern Memorial Healthcare Corporation. The NMEDW has the infrastructure to create OMOP tables from data marts populated by our primary EHR (Epic), as well as from ancillary and legacy systems.

SCRIPT Data Source

SCRIPT data from the NMEDW were available for 9 of the 15 v5.3 OMOP Standardized Clinical Data Tables: Person, Provider, Observation_Period, Visit_Occurrence, Condition_Occurrence, Drug_Exposure, Procedure_Occurrence, Measurement, and Observation. The remaining OMOP tables were not provided as they were not populated at that time by NMEDW, or were deemed irrelevant for the SCRIPT study. The database included all historical data from the EHR, not only the data collected during SCRIPT admission. The OMOP tables generated by the NMEDW contained an extra concept name column to facilitate human inspection, which was included in the graph version of each OMOP table. Data were provided as a limited data set as defined by the Health Insurance Portability and Accountability Act Privacy Regulations.

CRITICAL Data Source

In alignment with the SCRIPT dataset, we extracted an identical set of nine tables from the CRITICAL OMOP database. Since the CRITICAL OMOP database did not include the concept names in the event tables, we preprocessed the dataset to include the concept names and then imported into the graph database. Similar to SCRIPT, this included all medical data for each individual in the cohort, not just data during ICU stays, and was also considered a limited data set.

Model Design

The mapping from OMOP CDM to the graph schema evolved over several iterations ([Table 1]).

Table 1
Evolution of graph models for OMOP CDM
Version	Description and changes	Rationale for update	Strengths	Weaknesses
Version 1	Naïve interpretation of relational table to node and “HAS_EVENT” as edges with Person node as the network center	N/A	Intuitive structure	Network is not easily traversable
Version 2	Added the ASSOCIATED_DURING_VISIT relationships with both Person and visit node as the center; updated demographics from node attributes to node	To directly link the encounter and event nodes	Easier network traversal than Version 1; smaller database size compared to later versions	Most information still stored as attributes
Version 3	Added the corresponding concept nodes for instances table	To allow the database to support semantics queries	Most frequently queried information and relationships moved to graph edges. Vocabulary-based queries enabled	Increased database size on disk compared to Version 2
Version 4	Transformed the node from unique entity to entity occurrences; moved edge attributes to node attributes; added provider nodes and edges	To encode the patient events network to the graph topology as much as possible	Simplified terminology management. More complex ontology-based queries enabled	Increased database size on disk compared to Version 3

Abbreviations: OMOP, Observational Medical Outcomes Partnership; CDM, common data model.

We started with reviewing the OMOP CDM ([Fig. 1]) relational database tables and created one spreadsheet per table to list the column names and metadata with SCRIPT-specific details. Three team members (M.K., L.V.R., and J.B.S.) independently reviewed each table to decide whether each column should be a node or edge attribute, and whether it should be a unique index. Decisions were then reviewed and discussed during weekly team meetings to reach consensus. In general, we only selected the primary key and foreign key with some important properties like concept name, start date, and end date to include in our graph schema. We used the arrows web app (http://www.apcjones.com/arrows/#) to record and visualize the schema versions. We started with a naïve graph model approach, transforming each table into a node type and columns in each table as node attributes. The schema is similar to the star schema used in relational databases, with the Person table in the center and instances table around. The instances tables (Condition, Drug_Exposure, Measurement, Procedure, Observation) were directly connected to Person and had no connection to Visit_Occurrence. Anticipating that there will be frequent research questions regarding demographics, we represented Gender, Race, and Ethnicity as independent nodes instead of as node properties. We designated this as Version 1 ([Fig. 2]).

Fig. 1 Observational Medical Outcomes Partnership (OMOP) common data model (CDM) v5.3.1. Highlighted are the OMOP tables included in the graph modeling.

Fig. 2 Graph schema Version 1. Version 1 represented a naïve and direct translation of the Observational Medical Outcomes Partnership (OMOP) relational structure into a graph structure.

Iterating on Version 1, in Version 2 ([Fig. 3]) we recognized that encounter (Visit_Occurrence) nodes are usually critical to clinical data analysis. To support this, we added an ASSOCIATED_DURING_VISIT relationship from other instances nodes to the encounter nodes, making Person and Visit_Occurrence nodes the center of the network.

Fig. 3 Graph schema Version 2. Version 2 added computed links to both Person and VisitOccurence to improve query efficiency.

For Version 3 ([Fig. 4]), we added OMOP vocabulary (specifically concepts) to the model. This allowed the individual instance nodes to not duplicate the concept information for each node. Although OMOP uses a central Concept table, we split the concepts by data type, creating Measurement_Concept, Observation_Concept Visit_Concept, Procedure_Concept, Condition_Concept, and Drug_Concept nodes. Previous versions did not include the Provider node linked with the rest of the network. In this version, we added the relationship between each instance's nodes and Provider nodes. Specifically, we linked from Provider to Measurement, Observation_Occurrence, and Procedure_Occurrence nodes through a RESPONSIBLE_FOR edge, linked to Condition_Occurrence through a CAPTURED edge, and linked to Drug_Exposure through an INITIATED edge. The Provider nodes were then directly connected to instances nodes without the need to traverse from Visit_Occurrence nodes to other clinical events.

Fig. 4 Graph schema Version 3. Version 3 abstracted out concepts as edges rather than attributes.

In version 4 ([Fig. 5]), we revisited the decision to have multiple concept nodes, and aligned on a single Concept node that mimics the original OMOP design. This simplified the schema and database creation without affecting database performance. We also included the other OMOP standardized vocabulary tables (Vocabulary, Concept, Concept_Class) as nodes in our schema. To build the network between the vocabulary tables, we added two edges between each of the vocabulary nodes, allowing easy traversal from one vocabulary node to another and back. We also implemented self-directed relationships; the RELATED_TO edge captures the relationships in the concept_relationship table and can be used to describe sematic distances. (In this version, we chose a generalized relation, rather than the subtyped relations in the concept_relationship table.) The NEXT edge links the visit sequence and builds up patient journey.

Fig. 5 Graph schema Version 4. Version 4 was designed to maximize the amount of information represented as graph topology and minimize the use of attributes. Note that for testing purpose, we only loaded the identical entities for Versions 1 and 4, which means that the vocabulary tables were not loaded during the performance test.

Model Comparison

We selected Versions 1 and 4 for performance comparison. These two versions were selected because they represented implementations of the two major strategies for relational to graph conversion, specifically naïve recapitulation of the relational knowledge structure (Version 1) versus graph optimized (Version 4). Comparison metrics included database load time, query runtime, and database size. For test purposes, we did not include provider and vocabulary tables because (1) Version 1 did not support separate vocabulary tables and (2) because of this, the test queries could not utilize ontology-based queries. The test Neo4j database instance was set up on a server running AMD EPYC 7452 32-Core Processor. We used Neo4j Community Edition v4.4.15 for our evaluation.

We created comman-separated value (CSV) extracts from our source OMOP relational database and used neo4j-admin import to populate the database. Following the import, we added indexes for the Person and instance nodes (Condition_Occurrence, Drug_Exposure, Measurement, Procedure_Occurrence, Observation, Visit_Occurrence) for both Versions 1 and 4 models. Note that the ATC4 node and DRUG_ATC4 edge were added to both Versions 1 and 4 at the testing stage, when we wanted to check the medication classes rather than individual types. We did not create any edge index considering the two versions have significantly different edges, in both types and numbers. Two testing queries were developed based on actual questions posed by clinicians on the study team. The queries were built to represent queries that we felt would exercise the graph model relationships.

We developed scripts to automate benchmarking multiple iterations of database loading. Database loading was run 30 times each for both models, with the database recreated for each iteration. Each of the two testing queries was executed five times manually on SCRIPT and CRITICAL databases with Versions 1 and 4.

Results

Following the described iterative process, we refined a graph database model for the OMOP CDM, resulting in four separate versions of the schema. Diagrams of the node and edge relationships for each version of the schema are shown in [Figs. 2] [3] [4] [5], and a summary of each version along with strengths and weaknesses identified during development is presented in [Table 1]. We have made available a public repository on GitHub (https://github.com/NUSCRIPT/OMOP_to_Graph) with code and instructions on how to build a Neo4j graph database for the OMOP CDM, and benchmarking scripts for database loading and query running time.

After loading the same OMOP CDM dataset of 590 SCRIPT patients into the graph database, Version 1 resulted in 16,690 nodes and 8,091,100 edges, while Version 4 had a total of 8,088,034 nodes and 17,011,820 edges. The mean database loading time was 17.4 (standard deviation [SD]: 0.9) seconds for Version 1 and 28.8 (SD: 2.4) seconds for Version 4, which was statistically significant at p < 0.01. For the CRITICAL cohort with 134,145 patients, Version 1 resulted in 198,286 nodes and 1,498,570,382 edges, while Version 4 had a total of 749,011,396 nodes and 1,703,560,910 edges. The mean database loading time was 481.2 (SD: 74.9) seconds for Version 1 and 3,344.6 (SD: 468.0) seconds for Version 4 (also statistically significant at p < 0.01).

As can be seen in [Table 2], the relative performance of the two versions depended on the nature of the query. For the test on the SCRIPT cohort, Question 1 (find patients with specific diagnosis, procedure, and drug prescription) was approximately nine times faster on Version 1 than on Version 4. In contrast, Question 2 (find the most co-prescribed drugs) was approximately 26-fold faster on Version 4 than on Version 1. As for the CRITICAL cohort, Question 1 was about 49 times faster on Version 1 than on Version 4, while Question 2 was about threefold faster on Version 4 than on Version 1. We will discuss possible reasons for this observation in the next section.

Table 2
Cypher query running time
Model version	Test query number	SCRIPT database Mean (SD) execution time (ms)	CRITICAL database Mean (SD) execution time (ms)
1	1	268.0 (75.3)	967 (63.89)
1	2	121,414 (538.4)	5,839,586.40 (24,086.19)
4	1	2,763.8 (275.0)	48,639 (16,193.95)
4	2	4,633.2 (1,116.6)	2,118,551.25 (25,507.15)

Abbreviations: CRITICAL, Collaborative Resource for Intensive care Translational science, Informatics, Comprehensive Analytics, and Learning; SCRIPT, Successful Clinical Response in Pneumonia Therapy; SD, standard deviation.

Note: 1. Test query 1 is to find the patients who had spontaneous pneumothorax diagnosis, had a medication prescribed of dexamethasone, and had a “chest”-related procedure.

2. Test query 2 is to find the top 10 most frequently co-prescribed drugs.

Discussion

In this work, we present an efficient mapping from the OMOP CDM to a graph database. In the course of developing the schema, we demonstrated that naïve transformations from a relational to a graph schema can result in impaired performance depending on the queries performed.

We found that both Versions 1 and 4 of our model have strengths and weaknesses. Database build time for both versions took less than a minute and within an hour to load for SCRIPT and CRITICAL cohort, respectively, using the neo4j-admin import method. Version 1 was roughly 10 seconds faster on SCRIPT cohort and about 6 times faster on the CRITICAL cohort; however, Version 4 included about 8 million more nodes and 7.9 million more edges on SCRIPT cohort and 749 million more nodes and 205 million more edges than Version 1. Not surprisingly, the simpler query (e.g., Condition X and Drug Y) was faster on the simpler and smaller Version 1. Also, since Version 1 is a naïve translation of the relational model, it is consistent that it would perform well on a query type at which relational databases excel. On the other hand, Version 4 performed better in answering the question of finding the top 10 co-prescribed drugs during the same visit—a query that relies on information stored in the graph topology.

Although we lack vision into the inner working of the graph database engine, graph databases are typically optimized for storing and querying for information that is represented by the topology of the graph, rather than as attributes. In choosing which features of data to model as edges, it is important to think ahead of what questions the team wants to ask from the graph database. The tradeoff is that storing information in edge attributes takes more disk space than storing the same as node attributes. In addition, Version 4 is able to support more complicated queries regarding patient journey (e.g., find and visualize the sequence of patient diagnosis, medication prescription, and procedures in a certain month) that model Version 1 cannot. We note that in this work, we purposefully chose questions that both schemas can support to ensure a fair comparison between the two models.

Given the popularity and wide adoption of the OMOP CDM, there have been a variety of studies comparing the performance of various OMOP transforms. These have included transforming OMOP to Fast Health Interoperability Resources (FHIR)[29] and querying OMOP using i2b2[30]; OHDSI itself supports multiple RDBMS, including high-performance and cloud systems including Apache Spark, Google BigQuery and AWS Redshift, and additional work has evaluated columnar stores for OMOP.[31] These approaches have similarly demonstrated high-throughput query capabilities, but still rely on RDBMS schemas. Given that certain questions may be optimally solved using a graph database, we believe this work offers a solution that supplements (as opposed to replacing) RDBMS implementations that most institutions have in place. We believe our approach, which is openly available and can run on a free version of Neo4j, will facilitate broader adoption and evaluation of graph databases for the OMOP CDM. Previous works in the biomedical field that have leveraged graph databases have been primarily focused on knowledge representation with Gene Ontologies (GO) and Human Phenotype Ontology (HPO)[32] or other knowledge graphs[33] for drug discovery or health service claim data.[5] While Park et al suggested a modeling method to transform relational database to graph database, it requires the transition from relational database to a third normal form (3NF) relational database and then to 3NF Equivalent Graph Transform (3EG) they also only focused on the claims data instead of clinical data.[5] While there has been work in the Scalable, Standard based Interoperability Framework for Sustainable Proactive Post Market Safety Studies (SALUS) project to generate OMOP schema data from a Resource Description Framework (RDF) representation,[34] we have not identified prior work in the health and biomedical space that has focused on the conversion of relational OMOP data into graph schemas.

We acknowledge several limitations within our work. First, there have been many comparisons between RDBMS and Graph databases.[4] [35] Instead of adding one more comparison between the relational databases and graph databases, we focused on designing a graph database from the OMOP relational schema and discussed the strengths and weaknesses of the various graph schemas. In this study, evaluation was performed on cohorts of 590 and 134, 145 patients from a single institution, which are significantly smaller than a typical EDW. We intentionally used real-world, moderate-size datasets, the SCRIPT and CRITICAL databases, which can be easily implemented on the free downloadable version of Neo4j. We believe this strategy makes this work immediately applicable to many ongoing projects that utilize the OMOP CDM. Additionally, we focused on a subset of the OMOP standardized clinical tables and vocabulary; however, those tables represent the majority of clinical data categories, and ones that are included in most OMOP instances. Those wishing to extend this model to all OMOP tables can readily leverage the same “edge-centric” strategy to the other tables. Furthermore, we selected only two of the four versions we developed for performance evaluation. Based on our experience querying the various versions, we have no reason to believe that performance of Versions 2 and 3 would lie on a continuum between Versions 1 and 4. Finally, although the graph database was only implemented in a single graph database platform, we have no reason to believe that an edge-centric mapping strategy would not be applicable to other graph database systems.

Building upon this work, we will continue working with the SCRIPT project researchers to apply clustering, classification, and prediction algorithms to the graph database to contribute novel insights into the biological processes of pneumonia patients, patient trajectories, and associations between drug and diseases.

Conclusion

We developed a method of transforming OMOP CDM databases to graph databases. Our experimental results show our final model performed better than the initial naïve relational-to-graph version with respect to code simplicity, and query time on complex queries. The use of graph databases in conjunction with RDBMS and other analytic approaches offers more tools to researchers to identify new biological insights.

Clinical Relevance Statement

This work illustrates the implementation method of a graph database for EHRs that excels in answering clinical questions and finding potential patterns behind highly connected datasets using graph algorithms. It sets an example for clinical researchers to transform the OMOP CDM to the graph database and the method can be immediately applied to any OMOP database.

Multiple-Choice Questions

When implementing a graph database for EHR, which of the following is most recommended?
- Think ahead of the clinical questions to ask.
- Normalize the database to 3NF.
- Convert the entities to nodes and relationships to edges.
- Search previously published graph database design papers.
Correct Answer: The correct answer is option a. A graph database can perform the best when the schema is optimized for the most frequently asked queries or graph algorithms; b is not necessary; c is in general a good practice but not always correct. Graph schema is flexible, so starting from whiteboard targeting on the research questions rather than referring to other previous schemas is most recommended in the article.
Which of the following is the type of question that Version 4 of our graph schema can answer but Version 1 cannot?
- Basic pattern matching/filtering questions.
- Community detection questions.
- Ontology-related questions.
- Patient diagnosis classification.
Correct Answer: The correct answer is option c. Explanation: All the questions described in a, b, and c can be done in both versions, but Version 1 does not have specific vocabulary nodes and properties, and therefore cannot answer ontology-related questions.

Conflict of Interest

J.A.A.-G. is an employee of Neo4j. J.A.A.-G. joined the project after Neo4j was selected as our graph database engine and did not play a role in that decision. All other authors have no relevant conflicts to disclose.

Acknowledgements

We thank Dong Fu and the Feinberg School of Medicine Information Technology department for technical assistance. We also wish to thank Dr. Leah Welty for statistical guidance. We thank Dr. Yuan Luo and his team for providing the CRITICAL database for performance test. We also thank Dr. Richard Wunderink for providing guidance on clinical questions to test the database performance. We thank Dr. Nicholas Soulakis for his help in provisioning the on-prem Neo4j database server. Finally, we thank Dr. David Stumpf for providing optimization suggestions on Cypher queries.

Protection of Human and Animal Subjects

This study was conducted in accordance with the ethical standards of the institutional review board (IRB). All procedures involving human participants were reviewed and approved by the IRB of Northwestern University (STU00204868 for SCRIPT study and STU00212016 for CRITICAL study).

References
1 Kang M, Alvarado-Guzman JA, Rasmussen L, Starren JB. AMIA Summit. 2021 abstract: graph model for OMOP CDM.

Crossref Search in Google Scholar
Download RIS citation
2 Needham M, Hodler A. Graph algorithms. Accessed February 11, 2022 at: https://learning.oreilly.com/library/view/graph-algorithms/9781492047674/

Download RIS citation
3 Simpson CM, Gnad F. Applying graph database technology for analyzing perturbed co-expression networks in cancer. Database (Oxford) 2020; 2020: baaa110

Crossref PubMed Search in Google Scholar
Download RIS citation
4 Fabregat A, Korninger F, Viteri G. et al. Reactome graph database: efficient access to complex pathway data. PLOS Comput Biol 2018; 14 (01) e1005968

Crossref PubMed Search in Google Scholar
Download RIS citation
5 Park Y, Shankar M, Park BH, Ghosh J. Graph databases for large-scale healthcare systems: a framework for efficient data management and data services. IEEE 30th International Conference on Data Engineering Workshops. Chicago, IL; 2014: 12-19

Search in Google Scholar
Download RIS citation
6 Xia Y, Sun C. Property graph database modeling and application of electronic medical record. Paper presented at: 2018 Eighth International Conference on Instrumentation & Measurement, Computer, Communication and Control (IMCCC) 2018: 963-967

Crossref Search in Google Scholar
Download RIS citation
7 Campbell WS, Pedersen J, McClay JC, Rao P, Bastola D, Campbell JR. An alternative database approach for management of SNOMED CT and improved patient data queries. J Biomed Inform 2015; 57: 350-357

Crossref PubMed Search in Google Scholar
Download RIS citation
8 Neo4j, Inc. Modeling Designs: Developer Guides. Neo4j Graph Database Platform. Accessed February 11, 2022 at: https://neo4j.com/developer/modeling-designs/

Download RIS citation
9 Rasmussen LV, Brandt PS, Jiang G. et al. Considerations for improving the portability of electronic health record-based phenotype algorithms. AMIA Annu Symp Proc 2020; 2019: 755-764

PubMed Search in Google Scholar
Download RIS citation
10 Observational Health Data Sciences and Informatics. . OMOP Common Data Model – OHDSI. Accessed February 8, 2022 at: https://www.ohdsi.org/data-standardization/the-common-data-model/

Download RIS citation
11 National Center for Advancing Translational Sciences (NCATS). National COVID Cohort Collaborative (N3C). National Center for Advancing Translational Sciences. May 12, 2020 . Accessed February 11, 2022 at: https://ncats.nih.gov/n3c

Search in Google Scholar
Download RIS citation
12 U.S. Department of Health and Human Services. All of Us Research Program | National Institutes of Health (NIH). National Institutes of Health (NIH)—All of Us. 2020 . Accessed February 13, 2022 at: https://allofus.nih.gov/future-health-begins-all-us

Search in Google Scholar
Download RIS citation
13 Alvarado-Guzmán JA, Keren I. Relational to Graph Database: Migration. Published 2017. Accessed September 16, 2024 at: https://www.ohdsi.org/web/wiki/lib/exe/fetch.php?media=resources:jose_alvarado_rd2gd_ohdsi_submission_2017.pdf

Download RIS citation
14 Pfaff ER, Girvin AT, Gabriel DL. et al; N3C Consortium. Synergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborative. J Am Med Inform Assoc 2022; 29 (04) 609-618

Crossref PubMed Search in Google Scholar
Download RIS citation
15 Sathappan SMK, Jeon YS, Dang TK. et al. Transformation of electronic health records and questionnaire data to OMOP CDM: a feasibility study using SG_T2DM dataset. Appl Clin Inform 2021; 12 (04) 757-767

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
16 Maier C, Lang L, Storf H. et al. Towards implementation of OMOP in a German University Hospital Consortium. Appl Clin Inform 2018; 9 (01) 54-61

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
17 Sun H, Depraetere K, De Roo J. et al. Semantic processing of EHR data for clinical research. J Biomed Inform 2015; 58: 247-259

Crossref PubMed Search in Google Scholar
Download RIS citation
18 Lynch KE, Deppen SA, DuVall SL. et al. Incrementally Transforming Electronic Medical Records into the Observational Medical Outcomes Partnership Common Data Model: A Multidimensional Quality Assurance Approach. Appl Clin Inform 2019; 10 (05) 794-803

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
19 OMOP to PCORIv2 ETL Mapping Specification Version 0.1. 2015 . Google Search. Accessed March 12, 2024 at: https://www.google.com/search?q=OMOP+to+PCORIv2+ETL+Mapping+Specification+Version+0.1+15+May+2015&rlz=1C5GCCM_en&oq=OMOP+to+PCORIv2+ETL+Mapping+Specification+Version+0.1+15+May+2015&gs_lcrp=EgZjaHJvbWUyBggAEEUYOdIBCDEwMjhqMGo0qAIAsAIA&sourceid=chrome&ie=UTF-8

Search in Google Scholar
Download RIS citation
20 Klann JG, Phillips LC, Herrick C, Joss MAH, Wagholikar KB, Murphy SN. Web services for data warehouses: OMOP and PCORnet on i2b2. J Am Med Inform Assoc 2018; 25 (10) 1331-1338

Crossref PubMed Search in Google Scholar
Download RIS citation
21 Neo4j, Inc. Neo4j Graph Data Platform. Neo4j Graph Data Platform. Accessed February 13, 2022 at: https://neo4j.com/

Download RIS citation
22 SCRIPT research team. SCRIPT homepage. Accessed February 13, 2022 at: https://script.northwestern.edu/

Download RIS citation
23 Grant RA, Morales-Nebreda L, Markov NS. et al; NU SCRIPT Study Investigators. Circuits between infected macrophages and T cells in SARS-CoV-2 pneumonia. Nature 2021; 590 (7847) 635-641

Crossref PubMed Search in Google Scholar
Download RIS citation
24 Gannon D. Azure's new CosmosDB Planet-Scale Database. Published online 2017.

Crossref
Download RIS citation
25 Robinson I, Webber J, Eifrem E. Graph Databases. 2nd ed. O'Reilly Media, Inc.. 2015 ISBN: 9781491930892

Search in Google Scholar
Download RIS citation
26 Shannon P, Markiel A, Ozier O. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003; 13 (11) 2498-2504

Crossref PubMed Search in Google Scholar
Download RIS citation
27 Bastian M, Heymann S, Jacomy M. Gephi: an open source software for exploring and manipulating networks. Proc Int AAAI Conf Web Soc Media 2009; 3 (01) 361-362

Crossref Search in Google Scholar
Download RIS citation
28 Starren JB, Winter AQ, Lloyd-Jones DM. Enabling a learning health system through a unified enterprise data warehouse: the experience of the Northwestern University Clinical and Translational Sciences (NUCATS) institute. Clin Transl Sci 2015; 8 (04) 269-271

Crossref PubMed Search in Google Scholar
Download RIS citation
29 OHDSI FHIR Work Groups. Workgroups:mappings_between_ohdsi_cdm_and_fhir. Accessed February 13, 2022 at: https://www.ohdsi.org/web/wiki/doku.php?id=projects:workgroups:mappings_between_ohdsi_cdm_and_fhir

Download RIS citation
30 i2b2 tranSMART Foundation. i2b2 Community Wiki. Accessed February 13, 2022 at: https://community.i2b2.org/wiki/display/OMOP

Download RIS citation
31 Katsma B. Benchmarking big observational health data. Medium. 2020 . Accessed February 13, 2022 at: https://medium.com/@b.katsma/benchmarking-big-observational-health-data-97e148c393f4

Search in Google Scholar
Download RIS citation
32 Mughal S, Moghul I, Yu J, Clark T, Gregory DS, Pontikos N. UKIRDC. Pheno4J: a gene to phenotype graph database. Bioinformatics 2017; 33 (20) 3317-3319

Crossref PubMed Search in Google Scholar
Download RIS citation
33 Queralt-Rosinach N, Stupp GS, Li TS. et al. Structured reviews for data and knowledge-driven research. Database (Oxford) 2020; 2020: baaa015

Crossref PubMed Search in Google Scholar
Download RIS citation
34 Declerck G, Hussain S, Daniel C. et al. Bridging data models and terminologies to support adverse drug event reporting using EHR data. Methods Inf Med 2015; 54 (01) 24-31

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
35 Vicknair C, Macias M, Zhao Z, Nan X, Chen Y, Wilkins D. A comparison of a graph database and a relational database: a data provenance perspective. In: Proceedings of the 48th Annual ACM Southeast Conference. ACMSE'10. Association for Computing Machinery; 2010: 1-6

Search in Google Scholar
Download RIS citation

Address for correspondence

Mengjia Kang, MS

Feinberg School of Medicine, Northwestern University

Chicago, Illinois

United States

Email: marjorie.kang@northwestern.edu

Email: mengjiakang17@gmail.com

Publication History

Received: 23 August 2022

Accepted: 27 August 2024

Article published online:
04 December 2024

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

References
1 Kang M, Alvarado-Guzman JA, Rasmussen L, Starren JB. AMIA Summit. 2021 abstract: graph model for OMOP CDM.

Crossref Search in Google Scholar
Download RIS citation
2 Needham M, Hodler A. Graph algorithms. Accessed February 11, 2022 at: https://learning.oreilly.com/library/view/graph-algorithms/9781492047674/

Download RIS citation
3 Simpson CM, Gnad F. Applying graph database technology for analyzing perturbed co-expression networks in cancer. Database (Oxford) 2020; 2020: baaa110

Crossref PubMed Search in Google Scholar
Download RIS citation
4 Fabregat A, Korninger F, Viteri G. et al. Reactome graph database: efficient access to complex pathway data. PLOS Comput Biol 2018; 14 (01) e1005968

Crossref PubMed Search in Google Scholar
Download RIS citation
5 Park Y, Shankar M, Park BH, Ghosh J. Graph databases for large-scale healthcare systems: a framework for efficient data management and data services. IEEE 30th International Conference on Data Engineering Workshops. Chicago, IL; 2014: 12-19

Search in Google Scholar
Download RIS citation
6 Xia Y, Sun C. Property graph database modeling and application of electronic medical record. Paper presented at: 2018 Eighth International Conference on Instrumentation & Measurement, Computer, Communication and Control (IMCCC) 2018: 963-967

Crossref Search in Google Scholar
Download RIS citation
7 Campbell WS, Pedersen J, McClay JC, Rao P, Bastola D, Campbell JR. An alternative database approach for management of SNOMED CT and improved patient data queries. J Biomed Inform 2015; 57: 350-357

Crossref PubMed Search in Google Scholar
Download RIS citation
8 Neo4j, Inc. Modeling Designs: Developer Guides. Neo4j Graph Database Platform. Accessed February 11, 2022 at: https://neo4j.com/developer/modeling-designs/

Download RIS citation
9 Rasmussen LV, Brandt PS, Jiang G. et al. Considerations for improving the portability of electronic health record-based phenotype algorithms. AMIA Annu Symp Proc 2020; 2019: 755-764

PubMed Search in Google Scholar
Download RIS citation
10 Observational Health Data Sciences and Informatics. . OMOP Common Data Model – OHDSI. Accessed February 8, 2022 at: https://www.ohdsi.org/data-standardization/the-common-data-model/

Download RIS citation
11 National Center for Advancing Translational Sciences (NCATS). National COVID Cohort Collaborative (N3C). National Center for Advancing Translational Sciences. May 12, 2020 . Accessed February 11, 2022 at: https://ncats.nih.gov/n3c

Search in Google Scholar
Download RIS citation
12 U.S. Department of Health and Human Services. All of Us Research Program | National Institutes of Health (NIH). National Institutes of Health (NIH)—All of Us. 2020 . Accessed February 13, 2022 at: https://allofus.nih.gov/future-health-begins-all-us

Search in Google Scholar
Download RIS citation
13 Alvarado-Guzmán JA, Keren I. Relational to Graph Database: Migration. Published 2017. Accessed September 16, 2024 at: https://www.ohdsi.org/web/wiki/lib/exe/fetch.php?media=resources:jose_alvarado_rd2gd_ohdsi_submission_2017.pdf

Download RIS citation
14 Pfaff ER, Girvin AT, Gabriel DL. et al; N3C Consortium. Synergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborative. J Am Med Inform Assoc 2022; 29 (04) 609-618

Crossref PubMed Search in Google Scholar
Download RIS citation
15 Sathappan SMK, Jeon YS, Dang TK. et al. Transformation of electronic health records and questionnaire data to OMOP CDM: a feasibility study using SG_T2DM dataset. Appl Clin Inform 2021; 12 (04) 757-767

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
16 Maier C, Lang L, Storf H. et al. Towards implementation of OMOP in a German University Hospital Consortium. Appl Clin Inform 2018; 9 (01) 54-61

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
17 Sun H, Depraetere K, De Roo J. et al. Semantic processing of EHR data for clinical research. J Biomed Inform 2015; 58: 247-259

Crossref PubMed Search in Google Scholar
Download RIS citation
18 Lynch KE, Deppen SA, DuVall SL. et al. Incrementally Transforming Electronic Medical Records into the Observational Medical Outcomes Partnership Common Data Model: A Multidimensional Quality Assurance Approach. Appl Clin Inform 2019; 10 (05) 794-803

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
19 OMOP to PCORIv2 ETL Mapping Specification Version 0.1. 2015 . Google Search. Accessed March 12, 2024 at: https://www.google.com/search?q=OMOP+to+PCORIv2+ETL+Mapping+Specification+Version+0.1+15+May+2015&rlz=1C5GCCM_en&oq=OMOP+to+PCORIv2+ETL+Mapping+Specification+Version+0.1+15+May+2015&gs_lcrp=EgZjaHJvbWUyBggAEEUYOdIBCDEwMjhqMGo0qAIAsAIA&sourceid=chrome&ie=UTF-8

Search in Google Scholar
Download RIS citation
20 Klann JG, Phillips LC, Herrick C, Joss MAH, Wagholikar KB, Murphy SN. Web services for data warehouses: OMOP and PCORnet on i2b2. J Am Med Inform Assoc 2018; 25 (10) 1331-1338

Crossref PubMed Search in Google Scholar
Download RIS citation
21 Neo4j, Inc. Neo4j Graph Data Platform. Neo4j Graph Data Platform. Accessed February 13, 2022 at: https://neo4j.com/

Download RIS citation
22 SCRIPT research team. SCRIPT homepage. Accessed February 13, 2022 at: https://script.northwestern.edu/

Download RIS citation
23 Grant RA, Morales-Nebreda L, Markov NS. et al; NU SCRIPT Study Investigators. Circuits between infected macrophages and T cells in SARS-CoV-2 pneumonia. Nature 2021; 590 (7847) 635-641

Crossref PubMed Search in Google Scholar
Download RIS citation
24 Gannon D. Azure's new CosmosDB Planet-Scale Database. Published online 2017.

Crossref
Download RIS citation
25 Robinson I, Webber J, Eifrem E. Graph Databases. 2nd ed. O'Reilly Media, Inc.. 2015 ISBN: 9781491930892

Search in Google Scholar
Download RIS citation
26 Shannon P, Markiel A, Ozier O. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003; 13 (11) 2498-2504

Crossref PubMed Search in Google Scholar
Download RIS citation
27 Bastian M, Heymann S, Jacomy M. Gephi: an open source software for exploring and manipulating networks. Proc Int AAAI Conf Web Soc Media 2009; 3 (01) 361-362

Crossref Search in Google Scholar
Download RIS citation
28 Starren JB, Winter AQ, Lloyd-Jones DM. Enabling a learning health system through a unified enterprise data warehouse: the experience of the Northwestern University Clinical and Translational Sciences (NUCATS) institute. Clin Transl Sci 2015; 8 (04) 269-271

Crossref PubMed Search in Google Scholar
Download RIS citation
29 OHDSI FHIR Work Groups. Workgroups:mappings_between_ohdsi_cdm_and_fhir. Accessed February 13, 2022 at: https://www.ohdsi.org/web/wiki/doku.php?id=projects:workgroups:mappings_between_ohdsi_cdm_and_fhir

Download RIS citation
30 i2b2 tranSMART Foundation. i2b2 Community Wiki. Accessed February 13, 2022 at: https://community.i2b2.org/wiki/display/OMOP

Download RIS citation
31 Katsma B. Benchmarking big observational health data. Medium. 2020 . Accessed February 13, 2022 at: https://medium.com/@b.katsma/benchmarking-big-observational-health-data-97e148c393f4

Search in Google Scholar
Download RIS citation
32 Mughal S, Moghul I, Yu J, Clark T, Gregory DS, Pontikos N. UKIRDC. Pheno4J: a gene to phenotype graph database. Bioinformatics 2017; 33 (20) 3317-3319

Crossref PubMed Search in Google Scholar
Download RIS citation
33 Queralt-Rosinach N, Stupp GS, Li TS. et al. Structured reviews for data and knowledge-driven research. Database (Oxford) 2020; 2020: baaa015

Crossref PubMed Search in Google Scholar
Download RIS citation
34 Declerck G, Hussain S, Daniel C. et al. Bridging data models and terminologies to support adverse drug event reporting using EHR data. Methods Inf Med 2015; 54 (01) 24-31

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
35 Vicknair C, Macias M, Zhao Z, Nan X, Chen Y, Wilkins D. A comparison of a graph database and a relational database: a data provenance perspective. In: Proceedings of the 48th Annual ACM Southeast Conference. ACMSE'10. Association for Computing Machinery; 2010: 1-6

Search in Google Scholar
Download RIS citation

Permissions and Reprints

Related Journals

Subscribe to RSS

Share / Bookmark

Evolution of a Graph Model for the OMOP Common Data Model

Authors

Abstract

Keywords

Background and Significance

Objective

Materials and Methods

Study Cohort

Graph Database Platform Selection

Data Source

SCRIPT Data Source

CRITICAL Data Source

Model Design

Evolution of graph models for OMOP CDM

Model Comparison

Results

Cypher query running time

Discussion

Conclusion

Clinical Relevance Statement

Multiple-Choice Questions

Conflict of Interest

Acknowledgements

Protection of Human and Animal Subjects

References

Address for correspondence

Publication History

References