Valerie Pittet © ECCO |
Marieke Pierik © ECCO |
The FAIR Data Principles propose that scientific outputs should be Findable, Accessible, Interoperable and Reusable [1]. Their aim is to encourage the adoption of good practices with respect to the publication of scientific research data. To this purpose, they provide a list of recommendations regarding documentation of data and datasets that are designed to be helpful for researchers and for computers.
Overall, the Principles propose that a unique identifier should be assigned to datasets and that all data should be described as relevantly and in as much detail as possible using metadata. Metadata are “data about data”. Metadata summarise basic information about data (e.g. project title, author names, date created, date modified, standards used, file size) which can make it easier to find and work with particular data. Data and metadata could be stored and made findable and accessible through local institutional repositories with appropriate authorisation or through openly available global repositories when possible, depending on the type of research data.
To achieve this overall aim, the main question is to know where to start. The Epidemiological Committee (EpiCom) of ECCO has found that one important aspect of these Principles is the question of data interoperability, defined as the use of a formal, accessible, shared and broadly applicable language for knowledge representation [2]. This is mainly an issue of how to optimise use of the huge amount of unstandardised data (including structured data extracted from patient charts or patient-reported data) for clinical research. Indeed, the existing databases are usually very heterogeneous with regard to structure and content [3,4], reflecting different healthcare systems or different practices. They may rely on public or proprietary sources [5,6] and may capture information from diverse catchment populations, settings and time periods [7,8]. Existing databases usually have different underlying data models [9, 10], which may affect syntactic and semantic interoperability as well as data quality [6]. For example, smoking status may be variously defined and collected in different datasets, from a yes/no/currently/formerly option to the number of cigarettes per day/week. Therefore, the exchange and sharing of data sets, which calls for harmonisation of their characteristics for common analysis purposes, is challenging [11].
To overcome these challenges, international health data networks teams have proposed working on the development of common data models, e.g. the Clinical Data Interchange Standards Consortium (CDISC) Analysis Data Model (ADaM) or the Observational Medical Outcomes Partnership (OMOP, www.omop.org). The goal of common data models is to harmonise as much as possible data elements, formats, and terminologies, while preserving information and quality [3,5]. This is also a first step in working on and defining core data sets to describe the main characteristics that should be documented for a population of interest. Common data models are independent of specific study requirements but are specific to each data network [12]. They take account of local expertise in databases and coding systems because of the potential impact on analysis and interpretation of results; this means that local teams are consulted on details linked to their own datasets. Data are then locally processed using a unique shared script which is study specific, and aggregated and de-identified results can then be encrypted and shared to undergo further statistical analyses [3,5,12].
Several international health data networks using common core data sets have already shown great potential for the generation of timely healthcare evidence [5,13,14] and researchers are still engaged in a continuing process of gaining increased knowledge on dataset limitations and data quality issues [11,15]. EpiCom will coordinate the work to define common datasets, using a controlled vocabulary for data and metadata description, for use in the IBD data network. This appears quite important nowadays. On the one hand, we know that we will increasingly have access to huge amounts of newly generated data. Indeed, digitalisation makes it possible to collect new types of data in clinical practice (PROs and PREs). On the other hand, structured data from ambulatory or hospital information systems (including clinical, biology, pathology, diagnostics and genetics) will also be increasingly accessible. Therefore, deriving a common language to be used among datasets may also assist future data linkage with other databases, such as administrative or environmental databases (e.g. relating to air pollution, water supply or food products). In a first step to further this process, EpiCom will perform a topical review of existing cohorts and registers to assess commonalities and differences among datasets.
ECCO is a fantastic, and already established, network of experts that can help in defining a standard set of data to be collected in Inflammatory Bowel Diseases. This will enable sharing of IBD data and metadata and assist in the achievement of multiple final goals through the attainment of appropriate sample sizes. Such goals may include, for example, optimal identification of IBD patients for inclusion in clinical trials and the implementation of innovative research projects and analyses on shared datasets. In the future, initiatives such as EpiCom’s will also help in ensuring that all projects are “FAIR-ly compatible”, as requested by an increased number of funders.