The FAIR Principles: a roadmap to achieve a common language among IBD datasets?
Valerie Pittet EpiCom Member; Marieke Pierik EpiCom Chair
Valerie Pittet © ECCO |
Marieke Pierik © ECCO |
The FAIR Data Principles propose that scientific outputs should be Findable, Accessible, Interoperable and Reusable [1]. Their aim is to encourage the adoption of good practices with respect to the publication of scientific research data. To this purpose, they provide a list of recommendations regarding documentation of data and datasets that are designed to be helpful for researchers and for computers.
Overall, the Principles propose that a unique identifier should be assigned to datasets and that all data should be described as relevantly and in as much detail as possible using metadata. Metadata are “data about data”. Metadata summarise basic information about data (e.g. project title, author names, date created, date modified, standards used, file size) which can make it easier to find and work with particular data. Data and metadata could be stored and made findable and accessible through local institutional repositories with appropriate authorisation or through openly available global repositories when possible, depending on the type of research data.
To achieve this overall aim, the main question is to know where to start. The Epidemiological Committee (EpiCom) of ECCO has found that one important aspect of these Principles is the question of data interoperability, defined as the use of a formal, accessible, shared and broadly applicable language for knowledge representation [2]. This is mainly an issue of how to optimise use of the huge amount of unstandardised data (including structured data extracted from patient charts or patient-reported data) for clinical research. Indeed, the existing databases are usually very heterogeneous with regard to structure and content [3,4], reflecting different healthcare systems or different practices. They may rely on public or proprietary sources [5,6] and may capture information from diverse catchment populations, settings and time periods [7,8]. Existing databases usually have different underlying data models [9, 10], which may affect syntactic and semantic interoperability as well as data quality [6]. For example, smoking status may be variously defined and collected in different datasets, from a yes/no/currently/formerly option to the number of cigarettes per day/week. Therefore, the exchange and sharing of data sets, which calls for harmonisation of their characteristics for common analysis purposes, is challenging [11].
To overcome these challenges, international health data networks teams have proposed working on the development of common data models, e.g. the Clinical Data Interchange Standards Consortium (CDISC) Analysis Data Model (ADaM) or the Observational Medical Outcomes Partnership (OMOP, www.omop.org). The goal of common data models is to harmonise as much as possible data elements, formats, and terminologies, while preserving information and quality [3,5]. This is also a first step in working on and defining core data sets to describe the main characteristics that should be documented for a population of interest. Common data models are independent of specific study requirements but are specific to each data network [12]. They take account of local expertise in databases and coding systems because of the potential impact on analysis and interpretation of results; this means that local teams are consulted on details linked to their own datasets. Data are then locally processed using a unique shared script which is study specific, and aggregated and de-identified results can then be encrypted and shared to undergo further statistical analyses [3,5,12].
Several international health data networks using common core data sets have already shown great potential for the generation of timely healthcare evidence [5,13,14] and researchers are still engaged in a continuing process of gaining increased knowledge on dataset limitations and data quality issues [11,15]. EpiCom will coordinate the work to define common datasets, using a controlled vocabulary for data and metadata description, for use in the IBD data network. This appears quite important nowadays. On the one hand, we know that we will increasingly have access to huge amounts of newly generated data. Indeed, digitalisation makes it possible to collect new types of data in clinical practice (PROs and PREs). On the other hand, structured data from ambulatory or hospital information systems (including clinical, biology, pathology, diagnostics and genetics) will also be increasingly accessible. Therefore, deriving a common language to be used among datasets may also assist future data linkage with other databases, such as administrative or environmental databases (e.g. relating to air pollution, water supply or food products). In a first step to further this process, EpiCom will perform a topical review of existing cohorts and registers to assess commonalities and differences among datasets.
ECCO is a fantastic, and already established, network of experts that can help in defining a standard set of data to be collected in Inflammatory Bowel Diseases. This will enable sharing of IBD data and metadata and assist in the achievement of multiple final goals through the attainment of appropriate sample sizes. Such goals may include, for example, optimal identification of IBD patients for inclusion in clinical trials and the implementation of innovative research projects and analyses on shared datasets. In the future, initiatives such as EpiCom’s will also help in ensuring that all projects are “FAIR-ly compatible”, as requested by an increased number of funders.
References
- 1. Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018.
2. www.force11.org.
3. Gini R, Schuemie M, Brown J, et al. Data extraction and management in networks of observational health care databases for scientific research: A comparison of EU-ADR, OMOP, Mini-Sentinel and MATRICE strategies. EGEMS (Wash DC). 2016;4:1189.
4. Voss EA, Makadia R, Matcho A, et al. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. J Am Med Inform Assoc. 2015;22:553–64.
5. Brown JS, Holmes JH, Shah K, et al. Distributed health data networks: a practical and preferred approach to multi-institutional evaluations of comparative effectiveness, safety, and quality of care. Med Care. 2010;48:S45–51.
6. Brown JS, Kahn M, Toh S. Data quality assessment for comparative effectiveness research in distributed data networks. Med Care. 2013;51:S22–9.
7. Menditto E, Bolufer De Gea A, Cahir C, et al. Scaling up health knowledge at European level requires sharing integrated data: an approach for collection of database specification. Clinicoecon Outcomes Res. 2016;8:253–65.
8. Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. J Clin Epidemiol. 2005;58:323–37.
9. Kahn MG, Batson D, Schilling LM. Data model considerations for clinical effectiveness researchers. Med Care. 2012;50 Suppl:S60–7.
10. Ogunyemi OI, Meeker D, Kim HE, et al. Identifying appropriate reference data models for comparative effectiveness research (CER) studies based on data from clinical information systems. Med Care. 2013;51:S45–52.
11. Overhage JM, Overhage LM. Sensible use of observational clinical data. Stat Methods Med. Res 2013;22:7–13.
12. Trifiro G, Coloma PM, Rijnbeek PR, et al. Combining multiple healthcare databases for postmarketing drug and vaccine safety surveillance: why and how? J Intern Med. 2014;275:551–61.
13. Oderkirk J, Ronchi E, Klazinga N. International comparisons of health system performance among OECD countries: opportunities and data privacy protection challenges. Health Policy. 2013;112:9–18.
14. Hripcsak G, Ryan PB, Duke JD, et al. Characterizing treatment pathways at scale using the OHDSI network. Proc Natl Acad Sci U S A. 2016;113:7329–36.
15. Kahn MG, Callahan TJ, Barnard J, et al. A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data. EGEMS (Wash DC). 2016;4:1244
- Tags: EpiCom