2.1 Understanding the data lifecycle
Section outline
-
-
1. Definition
There are many definitions of research data. The OECD (Organisation for Economic Co-operation and Development) definition is the most commonly used:
"Research data" are defined as factual records (numerical scores, textual records, images and sound) used as primary sources for scientific research, and that are commonly accepted in the scientific community as necessary to validate research findingsSource : OECD, OECD Principles and Guidelines for Access to Research Data from Public Funding, Paris, 2007.
As example, the University of Leeds describes research data as:
Any information that has been collected, observed, generated or created to validate original research findings. Altough usually digital, research data also includes non-digital formats such as laboratory notebooks and diaries."Source : University of Leeds Library.In the context of open science, these definitions can be complemented by broadening the scope of research data produced by researchers, which can also allow other researchers to conduct new research projects.
-
2. Diversity of research data
Depending on the project, the research data may be:
- produced or collected: these are the data created, elaborated, generated during research activities (observations, measurements, etc.)
- pre-existing: these are already existing data (corpus, archives...) which are used for the project. The data used may initially have been collected in a context other than the research, but they are used as research data within the framework of the project.
There are several descriptive classifications. One of these is:
- the source of the data
- the form of the data
-
What is the source of the data?
Observation data
Observation data are captured in real time. They are captured by observing a behaviour or activity and are therefore most often unique and impossible to reproduce. This is the case with sensor data, neuroimaging, astronomical photography or survey data.
Experimental data
Experimental data are obtained from laboratory equipment. They are often reproducible but this can be costly. Chromatographs and DNA chips fall into this category.
Computational or simulation data
Computational or simulation data are generated by computer or simulation models. They generate more important metadata. They are often reproducible provided that the model is properly documented. For simulations data, the test model wich is used is often as important than the data generated from the simulation and sometimes even more so. Examples include meteorological models, seismic simulation models and economic models.
Derived or compiled data
Derived or compiled data are derived from the processing or combination of raw data. They are often reproducible but expensive. This is the case for data obtained by text mining, 3D models or compiled databases.
Reference data
Collection or accumulation of small datasets that have been peer reviewed, annotated and made available.
-
What form does this data take?
Textual data : Field or laboratory notes, survey responses...
Digital data : Tables, measures...
Audiovisual data : Images, sounds, videos…
Computer codes
Discipline-specific data : For example FITS in spatial data or CIF in crystallography...
Specific data produced by some instruments
-
3. Why manage and share your data
- Quantity: a good management is necessary because of big data and especially to avoid data loss.
- Quality: sharing data requires good data management practices, which improves the quality of research work.
- Validation of research results: sharing data contribute to validate research results. More and more publishers ask researchers to make available all underlying data mentioned in the submitted article.
- Integrity: making data available ensures a better security against scientific fraud.
- Valorisation: data sharing allows the researcher to enhance the value of his data and increase its visibility (citation).
- Funding: data sharing (based on the principle of "as open as possible, as closed as necessary") may be a condition for project funding.
- Reproducibility and reuse: the cost of creating, collecting and processing data can be very high. Reusing existing data rather than recreating them reduces time and cost of research.
- Interdisciplinarity: databases allow better search, extraction, cross-references and visualization of data, particularly from different disciplines.
- Exhumation of "fossilized" data: publications provide access to about 10% of the data. The 90% remaining stays on computer hard drives and are not used. They are called "fossilized data". Proper management and sharing of this data would prevent the loss of unique data.
- Patrimonial value: Some research data can have a scientific patrimonial value. It is particularly important to organize a good management and sharing of these data.
-
4. Research Data Life Cycle
The data life cycle is the set of steps involved in the management, preservation and dissemination of the research data, associated with research activities. This cycle guides researchers through the research data management process to enable them and their stakeholders to make the most of the research data generated.Source : DoRANum
It can be divided into six different phases: Planning, Collecting, Analysing, Publishing, Preserving, and Reusing.
Source : Adaptation of Research data lifecycle – UK Data Service -
External Resources
-