Yearb Med Inform 2011; 20(01): 112-120
DOI: 10.1055/s-0038-1638748
Working Group Contributions
Georg Thieme Verlag KG Stuttgart

Key Concepts to Assess the Readiness of Data for International Research: Data Quality, Lineage and Provenance, Extraction and Processing Errors, Traceability, and Curation

Contribution of the IMIA Primary Health Care Informatics Working Group
S. de Lusignan
1   IMIA Primary Healthcare Working Group Co-Chair, Primary Care and Clinical Informatics, University of Surrey, UK
,
S.-T. Liaw
2   General Practice, University of New South Wales, Australia
,
P. Krause
3   Software Engineering, University of Surrey
,
V. Curcin
4   Imperial College London
,
M. Tristan Vicente
5   St. George’s University of London
,
G. Michalakidis
6   Computing department, University of Surrey
,
L. Agreus
7   Center for Family and Community Medicine, Karolinska Institutet, Stockholm
,
P. Leysen
8   Faculty of Medicine, Dept. of Primary and Interdisciplinary Care, University of Antwerp
,
N. Shaw
9   ESRI Canada Health Informatics Research Chair / Scientific Director, Health Informatics Institute, Algoma University, Ontario, Canada
,
K. Mendis
10   IMIA Primary Healthcare Working Group Chair, University of Sydney, Australia
› Author Affiliations
Frank Sullivan and Mark McGilchrist for their comments on the manuscript; IMIA and EFMI for supporting their primary care informatics working groups. TRANSFoRm is supported by the European Commission DG INFSO (FP7 2477)
Further Information

Publication History

Publication Date:
06 March 2018 (online)

Summary

Objective

To define the key concepts which inform whether a system for collecting, aggregating and processing routine clinical data for research is fit for purpose.

Methods

Literature review and shared experiential learning from research using routinely collected data. We excluded socio-cultural issues, and privacy and security issues as our focus was to explore linking clinical data.

Results

Six key concepts describe data: (1) Data quality: the core Overarching concept – Are these data fit for purpose? (2) Data provenance: defined as how data came to be; incorporating the concepts of lineage and pedigree. Mapping this process requires metadata. New variables derived during data analysis have their own provenance. (3) Data extraction errors and (4) Data processing errors, which are the responsibility of the investigator extracting the data but need quantifying. (5) Traceability: the capability to identify the origins of any data cell within the final analysis table essential for good governance, and almost impossible without a formal system of metadata; and (6) Curation: storing data and look-up tables in a way that allows future researchers to carry out further research or review earlier findings.

Conclusion

There are common distinct steps in processing data; the quality of any metadata may be predictive of the quality of the process. Outputs based on routine data should include a review of the process from data origin to curation and publish information about their data provenance and processing method.

 
  • References

  • 1 Peterson K. Practice-based primary care research— translating research into practice through advanced technology. Family Practice 2006; 23: 149-50.
  • 2 de Lusignan S, van Weel C. The use of routinely collected computer data for research in primary care: opportunities and challenges. Fam Pract 2006; 23 (02) 253-63.
  • 3 Hummers-Pradier E, Scheidt-Nave C, Martin H, Heinemann S, Kochen MM, Himmel W. Simply no time? Barriers to GPs’ participation in primary health care research. Fam Pract 2008; 25 (02) 105-12.
  • 4 Translational Medicine and Patient Safety in Europe (TRANSFoRm). URL: http://www. transformproject.eu/
  • 5 International Medical Informatics Association (IMIA). Primary Health Care Informatics Working Group. URL: http://www.imia-medinfo.org/new2/
  • 6 European Federation for Medical Informatics (EFMI) Primary Care Informatics Working Group (PCI WG). URL: http://www.efmi.org/
  • 7 de Lusignan S, Pearce C, Shaw N, Liaw ST, Michalakidis G, Vicente M, Bainbridge M. What are the barriers to conducting international research using routinely collected primary care data. Stud Health Technol Inform 2011; 165: 135-40 DOI 103233/978-1-60750-735-2-135.
  • 8 de Lusignan S, Chan T, Theadom A, Dhoul N. The roles of policy and professionalism in the protection of processed clinical data: a literature review. Int J Med Inform 2007; 76 (04) 261-8.
  • 9 The International Standards Organization (ISO). 8402-1986 Quality Vocabulary. URL: http://www.iso.org/iso/iso_catalogue/catalogue_ics/catalogue_detail_ics.htm?csnumber=15570
  • 10 de Lusignan S. The optimum granularity for coding diagnostic data in primary care: report of a workshop of the EFMI Primary Care Informatics Working Group at MIE 2005. Informatics in Primary Care 2006; 14: 133-7.
  • 11 Pringle M, Ward P, Chilvers C. Assessment of the completeness and accuracy of computer medical records in four practices committed to recording data on computer. Br J Gen Pract 1995; 45 (399) 537-41.
  • 12 Williams JG. Measuring the completeness and currency of codified clinical information. Methods Inf Med 2003; 42 (04) 482-8.
  • 13 Thiru K, Hassey A, Sullivan F. Systematic review of scope and quality of electronic patient record data in primary care. BMJ 2003; 326 (7398): 1070.
  • 14 Roten I, Marty S, Beney J. Electronic screening of medical records to detect inpatients at risk of drug-related problems. Pharm World Sci 2010; 32 (01) 103-7.
  • 15 Aqil A, Lippeveld T, Hozumi D. PRISM framework: a paradigm shift for designing, strengthening and evaluating routine health information systems. Health Policy and Planning 2009; 24: 217-228.
  • 16 Arts DGT, Keizer NF, Scheffer GJ. Defining and Improving Data Quality in Medical Registries: A Literature Review, Case Study, and Generic Framework. J Am Med Inform Assoc 2002; 09: 600-11.
  • 17 Kushniruk A, Borycki E, Kuwata S, Kannry J. Predicting changes in workflow resulting from healthcare information systems: ensuring the safety of healthcare. Healthc Q 2006; Oct; 09 Spec No: 114-8.
  • 18 Tai TW, Anandarajah S, Dhoul N, de Lusignan S. Variation in clinical coding lists in UK general practice: a barrier to consistent data entry?. Inform Prim Care 2007; 15 (03) 143-50.
  • 19 Debar S, Kumarapeli P, Kaski JC, de Lusignan S. Addressing modifiable risk factors for coronary heart disease in primary care: an evidence-base lost in translation. Fam Pract 2010; Aug; 27 (04) 370-8.
  • 20 Zdun U. Semantic Lookup in Service-Oriented Architectures. Proceedings of Fourth International Workshop on Web-Oriented Software Technologies. 2004: 101-10 URL: http://eprints.cs.univie.ac.at/ 2797/1/lookup.pdf
  • 21 Pan J, Chen K, Hsu W. Self Risk Assessment and Monitoring for Cardiovascular Disease Patients Based on Service-Oriented Architecture. Computers in Cardiology 2008; 35: 637-40.
  • 22 Turbelin C, Boëlle PY. Improving general practice based epidemiologic surveillance using desktop clients: the French Sentinel Network experience. Stud Health Technol Inform 2010; 160 (Pt 1): 442-6.
  • 23 de Lusignan S, Stephens PN, Adal N, Majeed A. Does feedback improve the quality of computerized medical records in primary care?. J Am Med Inform Assoc 2002; 09 (04) 395-401.
  • 24 de Lusignan S, Mimnagh C. Breaking the first law of informatics: the Quality and Outcomes Framework (QOF) in the dock. Inform Prim Care 2006; 14 (03) 153-6.
  • 25 de Lusignan S, Khunti K, Belsey J, Hattersley A, van Vlymen J, Gallagher H. et al. A method of identifying and correcting miscoding, misclassification and misdiagnosis in diabetes: a pilot and validation study of routinely collected data. Diabet Med 2010; Feb; 27 (02) 203-9.
  • 26 Carey IM, Cook DG, De Wilde S, Bremner SA, Richards N, Caine S, Strachan DP, Hilton SR. Implications of the problem orientated medical record (POMR) for research using electronic GP databases: a comparison of the Doctors Independent Network Database (DIN) and the General Practice Research Database (GPRD). BMC Fam Pract 2003; 04: 14.
  • 27 Kostopoulou O, Delaney BC, Munro CW. Diagnostic difficulty and error in primary care—a systematic review. Fam Pract 2008; 25 (06) 400-13.
  • 28 Lanter D. Design of a Lineage-Based Meta-Data Base for GIS. Cartography and Geographic Information Systems 1991; 18 (04) 255-61.
  • 29 Yamamoto S. Reconstructing data-flow diagrams from structure charts based on the input and output relationship. IEICE Transactions on Information and Systems 1995; e78d (09) 1118-26.
  • 30 Moellman D, Cain J. Intelligence, mapping and geospatial exploitation system (IMAGES). Proceedings of Digitization of the Battlespace III 1998; 3393: 86-95.
  • 31 Buneman P, Khanna S, Tan WC. Data provenance: Some basic issues. FST TCS 2000: Proceedings 2000; 1974: 87-93.
  • 32 Cheney J, Chiticariu L, Tan WC. Provenance in Databases: Why, How and Where. Foundations and Trends in Databases 2007; 01 (04) 379-474.
  • 33 Chief Information Officer (CIO). Net-centric data strategy. Washington DC: Department of Defense; 2003 URL: http://cio-nii.defense.gov/docs/net-centric-data-strategy-2003-05-092.pdf
  • 34 Beresford NA, Broadley MR, Howard BJ, Barnett CL, White PJ. Estimating radionuclide transfer to wild species—data requirements and availability for terrestrial ecosystems. J Radiol Prot 2004; Dec; 24 (4A): A89-103.
  • 35 Simmhan YL, Plale B, Gannon D. A Survey of Data Provenance in e-Science. SIGMOD Record 2005; 34 (03) 31-6.
  • 36 Lee ES, McDonald DW, Anderson N, TarczyHornoch P. Incorporating collaboratory concepts into informatics in support of translational interdisciplinary biomedical research. Int J Med Inform 2009; January; 78 (01) 10-21.
  • 37 Groth P, Munroe S, Miles S, Moreau L. Applying the Provenance Data Model to a Bioinformatics Case. 2008 URL: http://www.mendeley.com/profiles/paulgroth/document/861379562/#highlighted
  • 38 Glavic B, Dittrich K. Data Provenance: A Categorization of Existing Approaches. URL: http://subs.emis.de/LNI/Proceedings/Proceedings103/giproc-103-014.pdf
  • 39 Goble C. “Position Statement: Musings on Provenance, Workflow and (Semantic Web) Annotations for Bioinformatics,” in Workshop on Data Derivation and Provenance. Chicago: 2002
  • 40 Moreau L, Clifford B, Freire J, Futrelle J, Gil Y, Groth P. et al. The open provenance model core specification (v1.1). Future Generation Computer Systems; July. 2010
  • 41 Groth P, Luck M, Moreau L. A protocol for recording provenance in service-oriented grids. In: Ed Higashino T. Lecture Notes in Computer Science. Proceedings of the 8th International Conference on Principles of Distributed Systems (OPODIS’04), Grenoble, France. Springer-Verlag; Berlin: 2005. 3544: 124-39 DOI: 10.1007/b138689.
  • 42 Michalakidis G, Kumarapeli P, Ring A, van Vlymen J, Krause P, de Lusignan S. A system for solutionorientated reporting of errors associated with the extraction of routinely collected clinical data for research and quality improvement. Stud Health Technol Inform 2010; 160 (Pt 1): 724-8.
  • 43 van Vlymen J, de Lusignan S, Hague N, Chan T, Dzregah B. Ensuring the quality of aggregated general practice data: lessons from the Primary Care Data Quality Programme (PCDQ). Stud Health Technol Inform 2005; 116: 1010-5.
  • 44 Marston L, Carpenter JR, Walters KR, Morris RW, Nazareth I, Petersen I. Issues in multiple imputation of missing data for large general practice clinical databases. Pharmacoepidemiol Drug Saf 2010; 19 (06) 618-26.
  • 45 Davis P, Jenkin G, Coope P, Blakely T, Sporle A, Kiro C. The New Zealand Socio-economic Index of Occupational Status: methodological revision and imputation for missing data. Aust N Z J Public Health 2004; 28 (02) 113-9.
  • 46 van Vlymen J, de Lusignan S. A system of metadata to control the process of query, aggregating, cleaning and analysing large datasets of primary care data. Informatics in Primary Care 2005; 13: 281-91.
  • 47 Durham E, Xue Y, Kantarcioglu M, Malin B. Private medical record linkage with approximate matching. AMIA Annu Symp Proc 2010; Nov 13; 2010: 182-6.
  • 48 Khabbazi MR, Yusof MDIsmail, Ismail N, Mousavi AS. Modeling of Traceability Information System for Material Flow Control Data. Australian Journal of Basic and Applied Sciences 2010; 04 (02) 208-16.
  • 49 bWorld Wide Web Consortium (W3C). Technology and Science Domain: Metadata and Resource Description. URL: http://www.w3.org/Metadata/
  • 50 van Vlymen J, de Lusignan S. A system of metadata to control the process of query, aggregating, cleaning and analysing large datasets of primary care data. Informatics in Primary Care 2005; 13: 281-91.
  • 51 Lord P, Macdonald A, Lyon L, Giarretta D. From Data Deluge to Data Curation. In: Proceedings of the UK e-science All Hands meeting. 2004; 371-5.
  • 52 Karasti H, Baker KS, Halkola E. Enriching the Notion of Data Curation in E-Science: Data Managing and Information Infrastructuring in the Long Term Ecological Research (LTER) Network. Computer Supported Cooperative Work 2006; 15: 321-58.
  • 53 Lavoi F. The Open Archival Information System Reference Model: Introductory Guide Microform and Imaging Review. Spring 2004; 33 (02) 68-81 DOI: 10.1515/MFIR.2004.68,.
  • 54 Seddon J. Systems Thinking in the Public Sector. Triarchy Press; 2008
  • 55 Biomedical Research Integrated Domain Group (BRIDG). URL: http://www.cdisc.org/bridg
  • 56 Clinical Data Interchange Standards Consortium (CDISC). URL: http://www.cdisc.org
  • 57 Bohensky MA, Jolley D, Sundararajan V, Evans S, Pilcher DV, Scott I, Brand CA. Data Linkage: A powerful research tool with potential problems. BMC Health Serv Res 2010; 10: 346.
  • 58 Nur U, Shack LG, Rachet B, Carpenter JR, Coleman MP. Modelling relative survival in the presence of incomplete data: a tutorial. Int J Epidemiol 2010; 39 (01) 118-28.
  • 59 Abdel MMWahab, Nofal LM, Guirguis WW, Mahdy NH. Statistical process control for referrals by general practitioner at Health Insurance Organization clinics in Alexandria. J Egypt Public Health Assoc 2004; 79 (5-6): 415-48.
  • 60 Aylin P, Best N, Bottle A, Marshall C. Following Shipman: a pilot system for monitoring mortality rates in primary care. Lancet 2003; 362 (9382): 485-91.
  • 61 University of Oxford. Centre for Evidence Based Medicine Levels of Evidence. 2009 URL: http://www.cebm.net/index.aspx?o=1025