Section outline

  • The scientific world has embrace digital technology in its research, publication and communication practices. It is now technically possible to open up science to the greatest number of people, by providing open access to publications and - as far as possible - to research data.

    This course introduces you to the challenges of Research Data Management and sharing (RDM) in the context of Open Science (OS).

    It was created within the framework of the Erasmus+ Oberred project in 2019. Other courses from the Oberred project are available on this platform.


    OBERRED project

    This course was carried out in the context of the Oberred project, co-funded by the Erasmus+ Programme of the European Union.

    Oberred is an acronym for Open Badge Ecosystem for the Recognition of Skills in Research Data Management and Sharing. The aim of the Oberred project is to create a practical guide that includes the technical specifics and issues of Open Badges, roles and skills related to RDM, and principles for the application of Open Badges to RDM.

    Find out more about the Oberred project here: http://oberred.eu/


    This course is open access!

    No account creation or registration is required, however you will only be able to browse it in read-only mode.
    To participate in the activities (exercises, forum...) and get the badge(s), you must register for the course.
    Register for the course

    For optimal use of this course, we recommend using the Google Chrome browser.

    • Course structure

      This first lesson is an introduction to Research Data Management. It will enable you to grasp the context in which research data management takes place and give you an overall vision of the stakes involved in opening up and sharing such data.

      • Data and society
      • Data and science
      • Science and society : Covid-19 example
      • Open science and RDM
      • Evaluation1

      The second lesson will enable you to better understand the different steps of research data management, and to know the practices to be implemented and the tools to be used.

      • Understanding the data life cycle
      • FAIR principles
      • Data Management Plan (DMP)
      • Legal and ethical aspects
      • Metadata
      • Persistents identifiers
      • The 3 distincts stepsof data storage
      • Reuse and valorisation of data
      • Evaluation 2 


      Learning objectives

      This course should provide you with a good understanding of the context in which research data management and sharing takes place:

      • What are the issues and benefits of controlled data management? 
      • What concepts are related? 
      • How is data management organized and which actors are involved?
    • Auteur(s) / Formateur(s): Viêt Jeannaud, Nicolas Hochet, Yvette Lafosse, Pierrette Paillassard, Claire Sowinski, Coralie Wysoczynski, Marta Blaszczynska, Mateusz Franczak, Michel Roland, Tomasz Umerle, Beata Koper, Barbara Wachek, Lucas Ricroch
      Public cible: everyone
      Durée estimée: 1 week
      Prérequis: none
      Licence: CC BY-NC-SA
      Open badge: Yes
      Nombre d'inscrits: 3


    • The world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly. This in principle makes it possible to do many things that previously could not be done: spot business trends, prevent diseases, combat crime and so on. Managed well, data can be used to unlock new sources of economic value, provide fresh insights into science and hold governments to account. Changes to how we live, work, socialise Advances in digital technology changed how we communicate (Facebook, Twitter, Instagram, WhatsApp, Skype, etc.), work (videoconference, mails, Google Drive, Microsoft Teams, etc.), eat (Uber eats, etc.), travel (Uber, Couchsurfing, Booking.com, Airbnb, etc.), entertain ourselves at home (Netflix, streaming books/podcasts, etc.).


    • The impact of these technologies on our lives is therefore vast and seems rather well accepted according to the 2017 Eurobarometer. “75% of respondents think the most recent digital technologies have a positive impact on the economy, 67% - on their quality of life, 64% - on the society. 76% who use Internet every day say the impact of these technologies on their quality of life has been positive, compared to 38% who never use the Internet.”

      Source: European Commission, Attitudes towards the impact of digitisation and automation on daily life
    • The European Commission is taking this digital context into account in its plans, as is the data strategy

      The European data strategy aims to make the EU a leader in a data-driven society. Creating a single market for data will allow it to flow freely within the EU and across sectors for the benefit of businesses, researchers and public administrations.

      So data is everywhere, so much so that we talk about Big Data. But what do we really mean by Big Data?


    • Big data

      It is very difficult to have a common definition of Big Data. Big Data identifies a very unclear group of digital data stored for commercial, administrative and scientific objectives.
      Big Data consists of 3 main characteristics, called the 3Vs: Volume - Velocity - Variety
      Big Data means different things to different people. Regardless of the sources of the digital data, such as books, social media, databases, audio, and video, big data exhibits the characteristics of high-volume, high-velocity (speed of data in and out), and high variety (range of data types and sources).


    • This new type of data enriches research prospects and has potential to advance research in the humanities and social sciences in the following ways:

      • Advanced big data collection tools, such as web scraping, and innovative analytic techniques, such as machine learning, may help establish new research methodologies; 
      • New types of data may reveal new patterns and insights into human society, politics, and economics; 
      • New types of data may lead to new kinds of research questions that are beyond the perspectives of established theories.

    • A few terms related to big data

      Set of concepts and technologies (see below) that use intelligent behaviour based on algorithms (set of rules to be followed in the resolution of a specific problem).

      Automated analytical systems that learn over time, as more data they acquire.

      Algorithms that use neural networks to learn from unstructured data (images, audios, videos, posts on social networks...).

      Self learning systems that use sets of complex algorithms to mimic processes occurring in the humain brain.


    • Open data. Opening of administrative and political data


      Open data are freely accessible micro-data that can be used and reused freely by everyone. The term open data first appeared in 1995 in a document from an American scientific agency; it referred to the dissemination of geophysical and environmental data, but the idea that the empirical basis on which knowledge is built is a public good that should be available to all is much older.


      Don’t lock it away, do something useful with it notbrucelee, CC BY-SA

    • “the availability of open data creates opportunities for all kinds of organisations, government agencies and not-for-profits to come up with new ways of addressing society’s problems. These include predictive healthcare, and planning and improving London’s public transport system”

      Source : The Conversation, The future will be built on open data – here’s why
    • Example: A big boost to open data came in the late 2000s, when first the OECD invited member country governments to open their data in 2008 and then the United States government launched the datasite.gov in 2009, a web address designed to provide full access to databases and time series that were held by states of the Union and federal agencies.

      The European Union launched Open Government Partnership in 2011, an initiative for openness, transparency and civic participation, with the involvement of 65 governments that are committed to activating an action plan on five thematic areas - participation, transparency, integrity, accountability and technological innovation.

      The main difference with the past is that, while before some public bodies made all macro-data - that is, the aggregated data - available through publications, online documents, DVDs, etc., open data are micro-data which is downloadable from the internet free of charge, already in matrix format (generally in .csv or .xml format) and immediately usable for secondary analysis.



      These are generally data that have great relevance for the planning, monitoring and evaluation of public policies, and which are made open to all with a dual cognitive and regulatory objective. They provide technicians and experts with knowledge bases to redirect and improve policies and also allow citizens to find out whether the policies implemented have had the announced effects or not.

      Open data are also a consequence of the importance that transparency and accountability (the obligation for a subject to account for their decisions and to be responsible for the results achieved) are gaining nowadays.

      Although open data have created new opportunities for secondary research, it should be emphasized that there are limits to their use. Firstly, there is a problem of issues: if in principle open data can deal with any topic, to date completely public data are almost exclusively economic, geographic and related to transport. A second limitation concerns the way in which data is opened. It would be necessary to complement the matrices with a series of additional information on the methodological choices that have been made to produce those data and indications on the various aspects of their quality. Often this information is missing and this makes it difficult to analyze the data effectively.
    • Why data should be open?

      1. Transparency: In a well-functioning democratic society citizens need to know what their government is doing. To do that, they must be able to freely access government data and information and share that information with other citizens. Transparency isn’t just about access, it is also about sharing and reuse — often, to understand material it needs to be analyzed and visualized and this requires the material to be open so that it can be freely used and reused.
      2. Releasing social and commercial value: In the digital age, data is a key resource for social and commercial activities. Everything from finding your local post office to building a search engine requires access to data, much of which is created or held by government. By opening up data, government can help drive the creation of innovative business and services that deliver social and commercial value.

      3. Participation and engagement: Participatory governance or, for business and organizations, engaging with your users and audience. Much of the time citizens are only able to engage with their own governance sporadically — maybe just at an election every 4 or 5 years. By opening up data, citizens are enabled to be much more directly informed and involved in decision-making. This is more than transparency: it’s about making a full “read/write” society — not just about knowing what is happening in the process of governance, but being able to contribute to it .
    • Evolution of science

      The way science looks today differs greatly from the scientific practices of the past. The colossal amount of data and the tools for handling them have a dramatic effect on the way science is done.

      Big Data is changing science in two ways:
      • Science can gather increasing amounts of data from the society that may be used for analysis. 
      • Scientific activities themselves also produce larger amounts of data than ever before. Big data and science

      Lien entre les données massives et la science : collecte de données, analyse, et production d'informations scientifiques.

      We live in a data-driven world. At any time we have access to a huge amount of digital information, which is growing daily. The increase in the amount of available data has opened the door to a new area of research based on big data - huge data sets that contribute to the creation of better operational tools in all sectors as well as develop scientific research.
    • Data driven science: a new paradigm?

      Science is the pursuit and application of knowledge and understanding of the natural and social world following a systematic methodology based on evidence: observation, experiment, induction, repetition, critical analysis, verification and testing.

      Since the beginnings of science, different scientific methodologies have emerged. Some have profoundly changed the way research is conducted, leading to paradigm shifts. The impact of data on science is also causing profound changes. We speak of data driven science, an empirical research method which aims at making inferences from to huge amounts of data.


      The debate on the advent of a fourth paradigm remains open. For some, it is not so much a new paradigm as a method which is complementary to traditional approaches and is needed because of the presence of large volumes of data.

      In any case, science is increasingly focused on data which, because of their openness and exponential growth, must now be taken into account in the scientific research process.

      Let's focus now on the consequences of the consideration of data according to disciplines.
    • Consequences according to disciplines

      The term 'data' intuitively seems to be more prevalent in natural and social sciences (e.g. survey data, experimental data). Today's humanities researchers seem more inclined to consider their sources and results as research data due to the widespread use of digital means in the academic workflows.


      Disciplinary specificities: the digital humanities

      Digital Humanities is an emerging field of science where scholars from across the humanities (historians, linguists, artists, media scholars, etc.) work in tandem with librarians, computer and data scientists.


    • At the beginning, the digital humanities were mainly curating and analyzing data that were born analogue (texts, objects and images) but subsequently archived into digital forms that could be searched for automated guide analysis and visualization. Today, digital humanities consist of the use of sophisticated tools of curating and sharing data, augmenting the scale of research across a more vast range and volume of sources. Rather than concentrating on a basket of sources to analyze, it becomes possible to manage thousands of cultural products (paintings, books, photos, articles, etc.). Counting, classifying, graphing and mapping these data may offer new insights and raise interest in humanities as a field of science.

      Some common practices in Digital Humanities are Text and Data Mining and Data visualization.
    • Text and Data Mining

      Text mining, or Text and Data Mining (TDM), is a field which, with the use of appropriate tools, deals with text analysis, exploration, preparation of summaries, clustering and categorisation of documents, finding groups of words with similar meaning or automatic recognition of complex expressions.

      By using text-mining methods it is possible to obtain data from the text that are suitable for quantitative statistical analysis. Using text mining represents a completely different approach to text data. They are no longer treated as purely qualitative data, but as a specific source of quantitative data - above all, on the frequency of occurrence of individual words in the analysed text. Text mining allows relatively automated searches of very large portions of text for keywords, their density and so on. This makes it possible to apply new methods of data analysis and to obtain new types of information concerning, among other things, the nature of the analysed texts or the variation in the frequency of keywords over time.
      Gabriel Gallezot, Marty Emmanuel. Le temps des SIC. MIÈGE, Bernard, PELISSIER, Nicolas et DOMENGET. Temps et temporalités en information-communication: Des concepts aux méthodes., L’Harmattan, pp.27-44, 2017, 10.5281/zenodo.1000778. sic_01599944


    • Data visualization

      This modernised technology (and at the same time methodology) is increasingly present in every sphere of human activity: from research and development to business, social activities and art. It offers practical knowledge of how to graphically "master" huge sets of data that describe a given aspect of reality.




      Example of a data vizualisation from a research on Icos Carbon Portal
      The purpose of data visualization is to show information in a way that allows its accurate and effective understanding and analysis. This is because people easily recognize and remember the images presented to them (shape, length, construction etc.). Thanks to visualization we can combine large data sets and show all the information at the same time, which greatly facilitates analysis. We can also use visual comparisons, thanks to which it is much easier to find many facts. Another advantage is the ability to analyse data at several levels of detail.

      Here is an example of data visualization from the "Republic of Letters". Researchers map thousands of letters exchanged in the 18th century and can learn very rapidly what it once took a lifetime of study to comprehend.



      We deal with visualization at every step of our lives. Graphic representation is used on television, in the press and in any other source of information (excluding radio stations) whenever there is numerical data. Visualization is necessary when we want to show a certain currency rate at a certain time (linear chart), election results (histograms) or the weather forecast. However, these are not the only examples of graphic representation of data. While it can serve to make it easier to see certain properties, it also makes it easier to discover them. This above all applies to large data sets compiled over many years which can be used for subsequent research.
    • The social media, science and politics: how COVID-19 made science processes mainstream

      The recent COVID-19 pandemic has highlighted the long-standing need for more openness in science. This includes collaboration, knowledge-sharing, and exchange of ideas. A major aspect of this phenomenon has been the need for open research data in order to “accelerate the pace of research critical to combating the disease” (source: Why open science is critical to combatting COVID-19).

      The emergence of the new virus that we knew very little about put an enormous pressure on the governments to make quick decisions based on scarce data. 

      Moreover, public opinion was swayed by the fake news, emerging discussions by non-experts - including celebrities - on social networks as well as a number of harmful conspiracy theories.



      As this global crisis shows, the world of science cannot be perceived as an ivory tower. Instead, we need to see scientists as important actors who interact both with societies and with politicians in an active fashion.

      They need to be actively involved in the public debate, present their research results in an accessible way and offer recommendations for public policy.
    • The Lancet article affair: a question of data

      The medical journal The Lancet is one of the oldest and most reputable journals in its field. It has recently fallen from its pedestal following thewithdrawal of a study on hydroxychloroquine based on unreliable data. A look back at this case highlights the importance of access to source data.


      In relation to this case, researchers from around the world wrote an open letter to report numerous irregularities related to the number of COVID-19 cases in different countries.

      Quoting from the open letter:
      The authors have not adhered to standard practices in the machine learning and statistics community. They have not released their code or data. There is no data/code sharing and availability statement in the paper. The Lancet was among the many signatories on the Wellcome statement on data sharing for COVID-19 studies.
      Thus, the whole issue revolves around the question of data, or rather the lack of access to data and the impossibility of verifying it, which is synonymous with the inability to conduct reliable scientific research.
    • Consequences

      This appears undoubtedly as a major scandal in the field of medical research. It is all the more acute because it concerns an emergency situation - a pandemic caused by a previously unknown virus, which, of course, causes fear, but is also connected with the rise of conspiracy theories and the appearance of fake news. Currently, we are facing not only the health and economic consequences of a pandemic caused by a new coronavirus, but also - as many experts emphasize - an equally dangerous phenomenon accompanying it. This is the so-called infodemic which is a flood of false or misleading information about the virus, disease, etc. 

      This phenomenon is intensified especially in a situation where scientific publications and experts have been discredited and part of the society has stopped trusting them.

      Also, hasty political decisions were made as a result of blind faith in scientific publications. The consequences turned out to be catastrophic, and eventually the WHO decided to "restart" research programs on hydroxychloroquine.

      The issue concerns the basic principles that should govern science: transparency and rigorous evaluation of results before publication. In this case, both elements were missing.


    • At the same time, however, the positive aspects of this situation should be noted. As a consequence, the general public was acquainted with the issues and problems of scientific publications. Moreover, for the first time the issue of open access to data, as well as sharing and managing it, has reached such a large group of people not involved professionally in scientific research. As a result of this case, a large part of the society began to take an interest in the issue of accessibility to data and could understand its importance not only in modern science but also in everyday social life.
    • European policy

      Without any doubt, the COVID-19 crisis has challenged the way we view data and its relationship with contemporary society. This needs to be mirrored in European policy. Some actions were taken immediately.

      For instance, a dedicated section of the European Data Portal (EDP) was created in order to offer verified information and data on COVID-19. It was updated between April and July 2020 and aimed “to ensure that everyone – even people without extensive data skills – can understand and gain insights from the available data”.

      This also allowed the citizens themselves to become powerful actors who could collect, organise and analyse data and create useful datasets.

      62 datasets and 60 data initiatives were collected as a result.

      The EDP editorial team concluded by stating a need for “a cultural change, in which all of us become more data literate and embrace it as a valuable means of information for evidence-based decision-making”.
    • What is Open Science?

      The opening to scientific knowledge materialised during the 17th century, when the first academic journals were created. These journals provided access to knowledge for society, and enabled different scientific groups to share their resources and conduct their work collaboratively.

      Open Science can be seen as a movement in this continuity of access to scientific knowledge. It seeks to make scientific research and the data it produces accessible to all and at all levels of society.

      Open Science represents a novel approach to scientific development, based on cooperative work and information distribution through networks using advanced technologies and collaborative tools. Open Science seeks to facilitate knowledge acquisition through collaborative networks and encourage the generation of solutions based on openness and sharing.
      Source : European Commission, "Study on Open Science: Impact, Implications and Policy Options” by Jamil SalmiAugust –2015
    • Open access to publications may immediately come to mind when you think about Open Science. However, Open Science is about more than Open Access only and also include MOOCs, Open source software, Citizen science, Open peer-review and Open Data.


    • Open Science initiatives in Europe

      Since 2016, the European Commission has organised its Open Science policy according to eight ambitions: 

      • Open data,
      • European Open Science Cloud (EOSC),
      • Next Generation Metrics,
      • Future of scholarly communication,
      • Rewards, 
      • Research integrity, 
      • Education and skills, 
      • Citizen science.
    • Examples of OS initiatives

      Many initiatives are underway in Europe. Here are some of them.

    • Why is Research Data Management strongly related to Open Science

      One of the major challenges for Open Science therefore concerns the opening up of data. But to be effective, data opening must go hand in hand with good data management. In order to be reusable, research data must indeed be rigorously processed (e.g. it must be well documented, described by metadata and recorded in open formats).

      There is no simple definition for Research Data Management because it depends on many factors such as the specificity of the project, type of data and others. However, the definition below makes it quite clear what Research Data Management is.

      Research data management (or RDM) is a term that describes the organization, storage, preservation, and sharing of data collected and used in a research project. It involves the everyday management of research data during the lifetime of a research project (for example, using consistent file naming conventions). It also involves decisions about how data will be preserved and shared after the project is completed (for example, depositing the data in a repository for long-term archiving and access).
      Source: https://pitt.libguides.com/managedata

      And as stated in the Guidelines on FAIR Data Management in Horizon 2020, "Good data management is not a goal in itself, but rather is the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process".
    • Benefits of RDM and sharing

      The benefits of good data management and openness are numerous! In a few points:
      • New requirements and opportunities for researchers 
        • Researchers can better promote their research and be cited, as the data enter the scientific publishing process (data repository, publication of data papers). 
        • Data sharing may be a condition for obtaining funding for scientific projects or for the publication of an article. 
      • New perspectives for science 
        • Making data available offers a better guarantee against scientific fraud. 
        • Sharing data requires the adoption of good data management practices (describing data, documenting them, making them sustainable, etc.), which improves the quality of research work. 
        • The cost of creating, collecting and processing data can be very high. Reusing existing data rather than recreating makes research profitable, accelerates innovation and the return on investment in Research and Development. 
        • The creation of databases allows data mining (Text Data Mining), extraction, cross-checking and the construction of visualizations. These new processes make it easier to initiate new research initiatives and their interdisciplinary nature. 
        • The deluge of digital data (Big Data) is having an impact on the way scientific research is carried out. We talk about Data Driven Science, an approach that automates discoveries by harnessing the power of computers to find correlations among large amounts of data. 
      • A better use of public money and a return for society 
        • Publicly funded research must be open to all. Opening up data makes research more transparent, builds citizens' trust and enables them to get involved (e.g. in citizen science). 
        • The data generated by Open Data and Big Data provide a field for scientific research, which in turn can inform society about its most recent developments.

    • Badge OBERRED Research Data Management Context

      Test your knowledge on this first part of the introduction to data management and sharing.

      Success in this test is rewarded by an Open Badge! To pass this test, you must be enrolled in the course

      Check
    • Please answer these 5 questions related to the lesson to test your knowledge.

    • You can find your badge in your profile, in the badge section.


      👏 Congratulations👏

      You have successfully completed assessment 1 and obtained this open badge!Open Badge Context (RDM)

      See the badge

      Not available unless: The activity Evaluation 1 is complete and passed
    • Congratulations, you have completed lesson 1!

      Thank you for taking this first lesson. You can get your Open Badge at the bottom of this page.

      But first, we invite you to respond to this survey about Lesson 1 "Context and stakes of Research Data Management (RDM)".

      It will only take you a few minutes to answer these 5 questions and will help us to improve the MOOC. This survey is anonymous.
      Not available unless: The activity Evaluation 1 is complete and passed
    • 1. Definition


      There are many definitions of research data. The OECD (Organisation for Economic Co-operation and Development) definition is the most commonly used:

      "Research data" are defined as factual records (numerical scores, textual records, images and sound) used as primary sources for scientific research, and that are commonly accepted in the scientific community as necessary to validate research findings
      Source : OECD, OECD Principles and Guidelines for Access to Research Data from Public Funding, Paris, 2007.

      As example, the University of Leeds describes research data as:

      Any information that has been collected, observed, generated or created to validate original research findings. Altough usually digital, research data also includes non-digital formats such as laboratory notebooks and diaries."
      Source : University of Leeds Library.

      In the context of open science, these definitions can be complemented by broadening the scope of research data produced by researchers, which can also allow other researchers to conduct new research projects.

    • 2. Diversity of research data

      Depending on the project, the research data may be:
      • produced or collected: these are the data created, elaborated, generated during research activities (observations, measurements, etc.)
      • pre-existing: these are already existing data (corpus, archives...) which are used for the project. The data used may initially have been collected in a context other than the research, but they are used as research data within the framework of the project.
      These data can be qualitative (interview data, observation data, open-field questionnaire etc.) or quantitative (measurement table, scored evaluation questionnaire, thermometer etc.). Depending on the context in which they were created (capture or production), their exploitation, analysis and processing, research data may be of different kinds, contained in various media and of all types.

      There are several descriptive classifications. One of these is:
      • the source of the data 
      • the form of the data 

    • What is the source of the data?


      Observation data

      Observation data are captured in real time. They are captured by observing a behaviour or activity and are therefore most often unique and impossible to reproduce. This is the case with sensor data, neuroimaging, astronomical photography or survey data.


      Experimental data

      Experimental data are obtained from laboratory equipment. They are often reproducible but this can be costly. Chromatographs and DNA chips fall into this category.


      Computational or simulation data

      Computational or simulation data are generated by computer or simulation models. They generate more important metadata. They are often reproducible provided that the model is properly documented. For simulations data, the test model wich is used is often as important than the data generated from the simulation and sometimes even more so. Examples include meteorological models, seismic simulation models and economic models.


      Derived or compiled data

      Derived or compiled data are derived from the processing or combination of raw data. They are often reproducible but expensive. This is the case for data obtained by text mining, 3D models or compiled databases.


      Reference data

      Collection or accumulation of small datasets that have been peer reviewed, annotated and made available.

    • What form does this data take?

       Textual data : Field or laboratory notes, survey responses... 

       Digital data : Tables, measures...

       Audiovisual data : Images, sounds, videos…

       Computer codes

       Discipline-specific dataFor example FITS in spatial data or CIF in crystallography...

       Specific data produced by some instruments


    • 3. Why manage and share your data

      • Quantity: a good management is necessary because of big data and especially to avoid data loss.
      • Quality: sharing data requires good data management practices, which improves the quality of research work.
      • Validation of research results: sharing data contribute to validate research results. More and more publishers ask researchers to make available all underlying data mentioned in the submitted article.
      • Integrity: making data available ensures a better security against scientific fraud.
      • Valorisation: data sharing allows the researcher to enhance the value of his data and increase its visibility (citation).
      • Funding: data sharing (based on the principle of "as open as possible, as closed as necessary") may be a condition for project funding.
      • Reproducibility and reuse: the cost of creating, collecting and processing data can be very high. Reusing existing data rather than recreating them reduces time and cost of research.
      • Interdisciplinarity: databases allow better search, extraction, cross-references and visualization of data, particularly from different disciplines.
      • Exhumation of "fossilized" data: publications provide access to about 10% of the data. The 90% remaining stays on computer hard drives and are not used. They are called "fossilized data". Proper management and sharing of this data would prevent the loss of unique data.
      • Patrimonial value: Some research data can have a scientific patrimonial value. It is particularly important to organize a good management and sharing of these data.
    • 4. Research Data Life Cycle

      The data life cycle is the set of steps involved in the management, preservation and dissemination of the research data, associated with research activities. This cycle guides researchers through the research data management process to enable them and their stakeholders to make the most of the research data generated.
      Source : DoRANum

      It can be divided into six different phases: Planning, Collecting, Analysing, Publishing, Preserving, and Reusing.

      Cycle de gestion des données : planification, collecte, traitement, préservation, réutilisation, publication. Favorise access


      Source : Adaptation of Research data lifecycle – UK Data Service
    • 1. Definition

      It’s recommended to respect the 4 FAIR principles in order to ensure an optimal use of research data and associated metadata, both by people and by machines.

      Findable, Accessible, Interoperable, Reusable
      By SangyaPundir — Personnal work, CC BY-SA 4.0

      FAIR data are those that are Findable, Accessible, Interoperable and Reusable.
      What do each of these terms mean in a practical sense and how can you tell if your own research data is FAIR?
    • 2. Explaining the FAIR Principles

      A FAIR data is a data...

      Principle F is implemented through the use of persistent identifiers (for example: DOI), rich metadata, by listing in catalogs, in repositories...

      Principle A means implementing long-term storage of data and metadata, with facilitated access and/or download (standardized and open communication protocols), and specification of access and use conditions.

      Principle I means that the data is downloadable, usable, intelligible and combinable with other data, by humans and machines, through the use of standard formats, vocabularies and ontologies.

      Principle R relies on characteristics that make data reusable for future research or other purposes (teaching, innovation, reproduction/scientific transparency). This is made possible by a rich description that specifies the data provenance, the use of community standards, and the addition of licenses.


      The gradual adoption of these FAIR principles will make data easier to share and reusable by both humans and computer systems.


      Examples of implementation of FAIR principles

      Many recommended actions for the management and sharing of research data are fully or partially compliant with FAIR. Some examples are:

      As a researcher in Social and Human Sciences: I securely save and store my data throughout the project using SHAREDOCS and Huma-Num Box.
      I work in the field of ecology: I document the metadata associated with my data according to the EML (Ecological Metadata Language) standard.
      I organize and name my files in the same way as all project partners.
      As an archaeologist: I use a disciplinary controlled vocabulary, the PACTOLS thesaurus.

      I apply an Etalab license to my datasets.
      I deposit my genomics data in the GenBank data repository.
      My datasets are uniquely and persistently identified by a DOI.
      I communicate my source codes.
      I make my files available in .csv rather than .xls, that is, in an open and non-proprietary format.
      As an ethnologist: during my research project, I conducted interviews that have significant heritage value. I deposit my datasets in the CINES permanent archiving platform.


      These various actions contribute to making my data FAIR!

    • 3. Test your data

      Test your data with this checklist created by Sarah Jones & Marjan Grootveld, EUDAT (2017):

    • 4. Play with FAIR principles

      How do you think the FAIR principles benefits the researcher and the scientific community?

      Instructions: A researcher has produced data within a research project in accordance with FAIR principles. This offers immediate benefits within the framework of their project and career, but can also benefit the scientific community later. Place each card on one of the two zones identified "For the Researcher" and "For the Scientific Community".

    • Here is a video that explain what does a DMP contain and when it should be written:

    • 1. Definition

      DMP means Data Management Plan. It is a management tool structured in headings wich synthesize the description and evolution of data sets during your research project. it allows to prepare the sharing, reuse and long term preservation of data
      Source: DoraNum

      Diagramme circulaire décrivant les étapes FAIR de la gestion des données : planification, collecte, analyse, partage.

      Adaptation of Research data lifecycle – UK Data Service

    • 2. DMP: an essential document

      The DMP has become a widespread management tool. It is increasingly recommended or required worldwide.

      In Europe, projects funded by the European Commission are required to deliver a data management plan: the initial version of the DMP is included among the deliverables six months after the start of the project (Horizon 2020 models, ERC-European Research Council).

      To promote the management and sharing of research data, a lot of European initiatives have been deployed, including tools and infrastructures (e.g. the Zenodo repository, the OpenAIRE infrastructure, etc.). In the same way, many institutes, organizations and institutions propose institutional DMP models available to their communities.
    • 3. Actors and contributors

      The researcher is not alone to write his DMP. The DMP is an opportunity of collaboration with the various stakeholders of the project: scientists, IT specialists, data librarians, project managers, lawyers... Data management requires a collective effort!


      Designed by Freepik ; Designed by macrovector / Freepik Designed by makyzz / Freepik

      Moreover, universities, infrastructures and research organizations often issue recommendations to their research communities.

      Funders (such as the European Commission or Science Europe) and certain publishers can give precise recommendations (for example: obligation to write a DMP within 6 months of the start of the project for projects funded by Science Europe) or propose advices (for example: the European Commission indicates the existence of the Zenodo repository in its Horizon 2020 guide).


    • 4. DMP: a project management tool

      It is an evolving, dynamic and continuously updated document (introduction of a new dataset, patent deposit, changes in the consortium...). It is also a project management tool that facilitate the organization and description of data. It allows to define responsibilities, resources and produce FAIR data.


      Data Organization
      The DMP helps to organize data well throughout the project.

      Evolving Document

      Start drafting the DMP from the beginning of the project, with already known or planned elements. Then complete the DMP progressively. Plan for at least 2 versions: at the beginning and end of the project. For projects longer than 30 months, an intermediate version is required.

      Data Description
      In the DMP, describe how data will be obtained, processed, organized, stored, secured, preserved, shared... (data lifecycle).

      Responsibilities
      In the DMP, designate the person(s) responsible for data management for all project stages and within the partnership if applicable: data entry; metadata production; data quality control; data storage, sharing and archiving; DMP updating. Individuals can be named specifically or a function can be indicated if the person occupying it might change during the project.

      Resources
      Evaluate the necessary resources (budget, allocated time, personnel) to implement the actions described in the DMP: Time needed to prepare data for storage, sharing and archiving; equipment costs and personnel remuneration; storage costs (dedicated servers, processing, maintenance, security, access...), sharing costs (website, publication...) and data archiving expenses.

      Reliable Data
      The DMP allows data producers to ask themselves the right questions and thus improve the reliability of their data.

      DMP contribute to initiate very early a collective work on good practices and to anticipate questions related to data management (such as the choice of a repository, how to document the data ...).


    • 5. DMP structuration

      Many DMP models have been created by organizations, institutes and funders for their users, in order to respond to specific features or local contexts…Nevertheless, these models contain the same elements in accordance to the data life cycle:

      • Administrative informations
      • Description of data
      • Documentation, metadata, standards
      • Data storage during the project
      • Sharing data in a repository
      • Persistent archiving
      • Data security
      • Legal and ethical aspects
      • Responsibilities
      • Costs

      Consult the “data management checklist” created by ETH-Bibliothek; Bibliothèque EPFL :

      Source : https://zenodo.org/record/3332363#.Xz44rDaP5aQ
    • 6. Data management planning tools – DMPonline

    •  7. In brief

      To conclude, see the summary sheet on the DMP proposed by DataOne:




    • 1. Definition


      Metadata allows a more accurate description of the data. it is data about data.
      If we imagine a dataset as a can, then the metadata are the label that describes the consents of the can (date of production, creator, etc.).
      Source : Doranum - Parcours interactif sur la gestion des données de la recherche

    • Without metadata
      Boites de conserves sans étiquettes

      Source : picture by Araceli Jáuregui from Pixabay

      With metadata Boites de conserves avec des étiquettes différentes, rangées par types de boites

      Source :picture by heberhard from Pixabay

    • 2. Metadata in the data life cycle

      It is recommended to complete the metadata as the project progresses, with particular attention to:

      • the step of data sharing,
      • the step of persistent archiving (specific metadata will have to be added).

      Diagramme circulaire du cycle FAIR des données : Planning Research, Collecting Data, Processing & Analysing, Publishing & Sha

      Adaptation of Research data lifecycle – UK Data Service

    • 3. Embedded metadata vs enriched metadata

      There are two types of metadata: embedded and enriched metadata.


      Embedded Metadata
       They are automatically produced by devices (cameras, sound recorders, measurement instruments...). This is typically the case for smartphone photos or videos. Examples of generated metadata: GPS data, device type, date, technical calibration, etc.
      Enriched Metadata
      They are added by the author. Examples: keywords, subject, author, laboratory or organization, project name, license, etc.



      Don't forget to complete the embedded metadata with enriched metadata. Ideally, this metadata should be filled in as you go along. It is recommended to use disciplinary controlled vocabularies (ontologies, lexicons, thesaurus...). This will increase the ability of the data to be combined with other data.

      For example:

      • Drugs Codex
      • Taxonomic classifications
      • IUPAC nomenclature of chemistry

      To organize the metadata it is recommended to use a metadata standard specific to your discipline or adapted to your needs. If none exists, a metadata schema will need to be created.


    • 4. Difference between metadata schema and metadata standard

      • Metadata schema: it is the organization of metadata according to a model designed and created specifically for the needs of a project. This structuring is therefore unique and personalized!
      • Metadata standard: a standard is a schema that has been adopted as a model by a set of users: it is recognized, standardized and widely used.

      To find standards used in your discipline, you can interview your researcher collaborators or computer scientists or data librarians and see what are the practices in your field.

      Several directories and sites can be consulted:

      Don't forget to consult the information on metadata standards provided by data repositories.


    • 5. Examples of metadata standards


      It is an interdisciplinary standard for describing digital resources.

      Standard linked to the attribution of persistent DOI identifiers.

      Standard linked to the attribution of persistent DOI identifiers. https://schema.datacite.org/

      Disciplinary standard for the domain of social, behavioral, and economic sciences.

      Metadata standard for structural sciences domains (chemistry, materials science, earth sciences, biochemistry). http://icatproject-contrib.github.io/CSMD/

      Disciplinary standard in the biodiversity domain. http://rs.tdwg.org/dwc/

      Disciplinary standard in the ecology domain: it was largely designed to describe digital resources. It can also be used to describe non-digital resources such as paper maps or other media. https://eml.ecoinformatics.org/

      Disciplinary standard in the architecture domain. https://historicengland.org.uk/images-books/publications/midas-heritage/

      International standard for describing geographic information and services. https://www.iso.org/standard/53798.html

    • 6. Focus on the international Dublin Core standard

      The Dublin Core is a widely used international and multidisciplinary standard. Moreover, it is often the base of disciplinary or specific data standards. It contains 15 elements that constitute the minimum required:

      • elements related to content
      • elements related to intellectual property. 


      Dublin Core Metadata Element Set

      "The original DCMES Version 1.1 consists of 15 metadata elements, defined this way in the original specification:

        1. Contributor – An entity responsible for making contributions to the resource
        2. Coverage – The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant
        3. Creator – An entity primarily responsible for making the resource
        4. Date – A point or period of time associated with an event in the lifecycle of the resource
        5. Description – An account of the resource
        6. Format – The file format, physical medium, or dimensions of the resource
        7. Identifier – An unambiguous reference to the resource within a given context
        8. Language – A language of the resource
        9. Publisher – An entity responsible for making the resource available
        10. Relation – A related resource
        11. Rights – Information about rights held in and over the resource
        12. Source – A related resource from which the described resource is derived
        13. Subject – The topic of the resource
        14. Title – A name given to the resource
        15. Type – The nature or genre of the resource.”

      Source: https://en.wikipedia.org/wiki/Dublin_Core

      "The fifteen basic elements are considered as a common denominator and in most cases are not sufficiently precise. The basic elements have been extended (or specified) by a set of other terms called "qualifiers".

      Two classes of qualifiers are recognized:

      • element refinements that explain the meaning of an element;
      • encoding schemes or controlled vocabularies."

      Source: Extract translated from " Présentation des standards: le Dublin Core" by Elizabeth CHERHAL (Cellule MathDoc, UMS5638, CNRS/Université Joseph Fourier, Grenoble) - https://www.enssib.fr/bibliotheque-numerique/documents/1236-presentation-des-standards-le-dublin-core-dc.pdf  


      Carte mentale Dublin Core : 15 éléments standards ISO 15836-2003 et leurs qualificatifs

    • 7. Focus on DataCite metadata schema


      DataCite : Find, access, and reuse data

      Content from DataCite Metadata Working Group. (2019). DataCite Metadata Schema Documentation for the Publication and Citation of Research Data. Version 4.3. DataCite e.V. https://doi.org/10.14454/7xq3-zf69


      The Metadata Schema 
      DataCite’s Metadata Schema has been expanded with each new version. It is, nevertheless, intended to be generic to the broadest range of research datasets, rather than customized to the needs of any particular discipline.

       
      DataCite Metadata Properties
      There are three different levels of obligation for the metadata properties: 
      • Mandatory (M) properties must be provided, 
      •  Recommended (R ) properties are optional, but strongly recommended for interoperability and 
      •  Optional (O) properties are optional and provide richer description. 
      Researchers who wish to enhance the prospects that their metadata will be found, cited and linked to original research are strongly encouraged to submit the Recommended as well as Mandatory set of properties. The properties listed in Table 1 have the obligation level Mandatory, and must be supplied when submitting DataCite metadata.

      Table 1: DataCite Mandatory Properties
      ID Property Obligation
      1 Identifier (with mandatory type sub-property) M
      2 Creator (with optional given name, family name, name identifier and affiliation sub-properties) M
      3 Title (with optional type sub-properties) M
      4 Publisher M
      5 Publication year M
      10 Ressource type (with mandatory general type description sub-property) M

      The properties listed in Table 2 have one of the obligation levels Recommended or Optional, and may be supplied when submitting DataCite metadata.


      Table 2: DataCite Recommended and Optional Properties 
      ID Property Obligation
      6 Subject (with scheme sub-property) R
      7 Contributor (with optional given name, family name, name identifier and affiliation sub-properties) R
      8 Date (with type sub-property) R
      9 Language O
      11 AlternateIdentifier (with type sub-property) O
      12 RelatedIdentifier (with type and relation type sub-properties) R
      13 Size O
      14 Format O
      15 Version O
      16 Rights O
      17 Description (with type sub-property) R
      18 GeoLocation (with point, box and polygon sub-properties) R
      19 FundingReference (with name, identifier, and award related sub-properties) O

      DataCite Properties
      Table 3 provides a detailed description of the mandatory properties, which must be supplied with any initial metadata submission to DataCite, together with their sub-properties. [...] The third column, Occurrence (Occ), indicates cardinality/quantity constraints for the properties as follows:
      • 0-n = optional and repeatable 
      • 0-1 = optional, but not repeatable 
      • 1-n = required and repeatable 
      • 1 = required, but not repeatable
      Table 3 
    • 8. Focus on an example of a metadata model in the environmental domain

      "Here is a model you can use by choosing the metadata fields suitable to your context, the repository where you upload your data, and which will convey minimum and sufficient information to help others understand and reproduce your data." * The required fields about protocols can be repeated if several protocols have been implemented consecutively (e.g. sampling, sample preparation, measurements, data processing, etc.)

       Source: Extract of "Guide of Good practices - Research data management and promotion" by ARNOULD Pierre-Yves (OTELo), JACQUEMOT-PERBAL Marie-Christine (Inist-CNRS)
    • 9. In brief

      Metadata are useful for:

      • Understand the origin of the data
      • Understand the context of the creation or collect of data
      • Improving harvesting by machines (search engine)
      • Ensuring interoperability
      • Know the conditions for reusing and sharing data
      • Access useful information when data are not shared or destructed.

      To conclude, see the summary sheet on metadata proposed by DataOne:


       Source: https://dataoneorg.github.io/Education//lessons/07_metadata/L07_DefiningMetadata_Handout.pdf
    • Persistent identifiers are assigned to the data at the sharing step.

      Schéma des étapes du cycle FAIR des données : Collecte, Traitement, Publication, Conservation et Réutilisation des données

      Source: Adaptation of Research data lifecycle – UK Data Service


    • 1. Definition


      An identifier is a unique association between an alphanumeric code and an entity or a ressource. On the web, ressources are located by URLs. However, these URL's are not stable. If the resource is moved and/or renamed, it is no longer accessible. The browser then displays the 404 error code. Persistent identifiers guarantee a stable link to the online resource. The persistency is obtained by an active management of URLs.
      This management is ensured by recognized organizations, support by human and technical infrastructures. The identity of the resource is matched to its location on the web. The hypertext link access will be guaranteed and will never be broken.
      The role of persistent identifier is to facilitate the tracking, to locate, access and cite the results of research production:
      • Persistent identifiers allows a sure identification (to a resource, an author...).
      • Persistent identifiers for publications and data allow to access them over the long term.
      • They link published articles to the underlying data sets.
      • They also help to discover, share, reuse and cite the results of research and scientific production.

      Diagramme représentant les principaux éléments du cycle de gestion des données : publication, partage, vérification, citation
      Source
      : Doranum - Persistents identifiers

      The ideal identification is a combination of several identifiers:

      • PID for publications
      • PID for data
      • PID for authors
      • PID for research organizations
      Schéma cycle de publication et d'accès aux données scientifiques : organisation de recherche, auteurs, publication et données
      Source Doranum - Identifiants pérennes : FICHE SYNTHÉTIQUE

      For publications, the attribution of a perennial identifier is a well-established and systematized procedure. Most publishers and open archives automatically assign a persistent identifier to each article. This is most often a Handle or DOI. The latter is assigned through the CrossRef agency.

      Système d'identifiant persistant de publication scientifique : éditeur, archive ouverte, référentiel et outils d'attribution.


      Source Doranum - Les identifiants pérennes : un aperçu

      Identifiers are often assigned to your data when they are deposited in a repository: it can be a local identifier, or a unique global identifier.

      In this course, we will not talk about PID for publications.

    • 2. PID for data

      It is recommended to assign a persistent identifier to each dataset.

      Persistent identifiers for data are assigned to resources resulting from scientific production, for example datasets, images, sounds, physical objects... 

      With the deployment of the Internet and the online availability of research data, identifiers better adapted to the digital world have been put in place such as:

      • DOI (Digital Object Identifier)
      • Handle
      • PURL (persistent URL)
      • ARK (Archive Resource Key)
      • ePIC (European Persistent Identifier Consortium)…


      Focus on DOIhttps://datacite.org/index.html

    • 3. PID for author

      Having an author ID allows:

      • to make the link with his scientific productions
      • to be well identified and cited.

      The most widely used is ORCID, an international, neutral and independent identifier.

      There are also several types of identifiers dedicated to authors and contributors involved in research:

      • Commercial publishers assign local identifiers for their database: for example Clarivate with the ResearcherID identifier or Elsevier with Scopus Author ID,
      • Social networks such as ResearchGate and Academia.edu assign each registrant his or her own identifier,
      • open archives can propose the creation of a local identifier, for example arXiv author ID for the Arxiv open archive,
      • in the worldof libraries: the ISNI identifier (International Standard Name Identifier) is an international identifier attributed to persons and institutions involved in literary, artistic and intellectual production in the broadest sense. It is defined by an ISO standard.

      Focus on ORCIDhttps://orcid.org/

    • 4. PID for research organizations

      There are persistent contributor identifiers for the authors but also for the research organizations:

      • they allow to link the author with his organization
      • they are important for research organizations to identify all the scientific productions of their researchers

      Focus on ROR: https://ror.org/

    • 5. Play with PID

      Instructions: Place each card on one of the four zones identified "Author", "Data", "Research organization" and "Publication". Some cards appear twice.




    • Storage and secure backup, sharing, long-term archiving occur at different steps of the data lifecycle, and have distinct functions.
      Here is a scheme to understand the difference between these 3 steps:

      Schéma des flux de données: stockage sécurisé, partage en dépôt et archivage, avec interactions entre équipes de recherche.

      SourceDoRANum - Stockage, partage et archivage : quelles différences ?

    • 1.Storage and secure data backup during the project

      The first step is the storage and secure backup of the data during all the project:

      Secure backup : secure server (organization), collaborative work space > the research team, the researcher.

      The objectives are to:

      • ensure data security
      • facilitate access for all project collaborators

      Storage and secure data backup in the data lifecycle

      This concerns the first part of the data life cycle.

      First step secure backup : collecting data, processing & analysing data
      Adaptation of Research data lifecycle – UK Data Service

    • Secure data backup measures to be implemented

      Efficient backup means duplicating and storing data in different locations on different media in a time frame relevant to the project.

      The best is to apply the 3-2-1 rule, which means:

      1. keep 3 copies of the data,
      2. anticipate 2 distinct supports or technologies,
      3. 1 of which is off-site.

      In any case, it is necessary to organize and plan these backups, taking care to manage versions. At each step of the project, select the data to be backed up or deleted. The different states of the data are kept in correlation with the different processing steps, allowing to return to a previous version if necessary.

      This also requires the choice of a hosting and backup policy adapted to the needs of the project concerning the specificities of data backup (for example in case of sensitive data, big volumes...). This can be on local servers (virtual machines), an institutional cloud with secure access etc.

      To improve your knowledge about "durability of storage media", see this infography:


      Source: von Rekowski, Thomas. (2018, October). Durability of Storage Media. Zenodo

      Folder structure and file naming

      Reliable access requires some rules to organize folder structure and unique and accurate naming of data files:



      File Formats

      The choice of a format can be guided by:

      • the recommendations of an institution,
      • the uses of the scientific community of the discipline,
      • the software or equipment used. 

      The ideal is to opt for file formats that are as open as possible (non-proprietary), standardized and durable, for example:

      • prefer .csv over .xls
      • prefer .odt to .doc
      • prefer .jpg over .tif

      In any case, it is necessary to mention in the DMP which formats will be used.

    • 2. Depositing data in a repository for sharing

      This step occurs often after the project (but you can share your data earlier): it is necessary to deposit the datasets in a repository.
      A repository allow to storage, access and reuse of data.
      The sharing of data in a repository provides a wide access to the scientific community, for a short and medium term (5 to 10 years).

      Storage for sharing : general or disciplinary repository > the research team, other research teams, another research team

      Sharing data in the data lifecycle

      Data sharing is often complementary to scientific publication during and after the end of the research project.

      Second step storage for sharing : publishing & sharing data, preserving data
      Adaptation of Research data lifecycle – UK Data Service

    • How to prepare data according to FAIR principles

      The goal is to share the research data of the project in optimal conditions.

      All the data must be prepared according to FAIR principles, even if they are shared partially or with restricted access.
      The data must be deposited in the chosen repository with metadata, and eventually the source codes necessary for reading and understanding them.

      Here is a checklist to prepare efficiently the data:

      Not all data necessarily needs to be shared. The research team must select the datasets they wish to share and, for each of them, define access modalities.

      Check the compatibility and interoperability of data formats, Migrate if necessary to an appropriate, as open as possible format.

      Prepare source codes (e.g., scripts) if necessary to read and process the data.

      Complete and enrich metadata according to the chosen repository: if not already done, choose a metadata standard, if no suitable standard exists, create a metadata schema, complete the fields for each dataset, following the adopted standard.


      To improve your knowledge about "preparing your data collection for deposit", watch this video by UK DATA SERVICE :



      How to choose a data repository

      There are different categories of repositories:

      • publisher-specific
      • discipline-specific
      • institution-specific
      • and multidisciplinary repositories.

      Most often, the repository is recommended by the institutions (e.g. the French repository Data INRAE), by funders (e.g. Zenodo recommended by the European Commission) or by the scientific community (e.g. GenBank, Pangaea, Dryad, etc.). It is sometimes imposed by an editor (e.g. Gene Expression Omnibus).
      If there is no recommendation, choose it in a directory (e.g. re3data, OAD, OpenDOAR, FAIRsharing, etc.).
      In any case, a data librarian can help the research team to choose a relevant repository.

    • Example of a search in the re3data directory

      https://www.re3data.org/

      Filters can be used to search in this directory.





      For each repository, a short descriptive sheets presents

      • the subject,
      • the type of content,
      • the country,
      • a small summary,
      • icons of the criteria.

      Example for the 4TU repository: https://www.re3data.org/repository/r3d100010216



      Tip: The search engine Google Dataset Search is also a simple tool to search for data repositories.

    • 3. Long-term archiving

      Long-term archiving is the ultimate step in saving and storage research data.

      Long-terme archiving : long-term archiving platform > the research team, other research teams, another research team

      Long-term archiving in the data lifecycle

      Long-term archiving generally concerns only a part of the data produced by a project. For some projects, it is not necessary to archive data.

      Third step long-term archiving : preserving data
      Adaptation of Research data lifecycle – UK Data Service


      Definition

      The question of long-term archiving only concerns data:

      • with a scientific value for all the community
      • requiring preservation for at least 30 years.

      It is an expensive operation that needs an allocated budget. This is the responsibility of the laboratory and not the researcher.
      Concretely, long-term digital archiving consists of preserving the document and its content:

      • in its physical and intellectual aspects,
      • over the very long term,
      • to be always accessible and understandable.

      Long-term archiving services in Europe

      At the European level, there are several infrastructures that specifically propose long-term archiving services.
      The European Open Science Cloud (EOSC) Portal is an integrated platform that allows easy access to lots of services and resources for various research domains along with integrated data analytics tools. It includes services for long-term archiving, for example:

      EGI Archive storage: Archive Storage allows you to store large amounts of data in a secure environment freeing up your usual online storage resources. The data on Archive Storage can be replicated across several storage sites, thanks to the adoption of interoperable open standards. The service is optimised for infrequent access. Main characteristics: Stores data for long-term retention; Stores large amount of data; Frees up your online storage.

      B2SAFE: this is a robust, safe and highly available EUDAT service which allows community and departmental repositories to implement data management policies on their research data across multiple administrative domains in a reliable manner. A solution to: provide an abstraction layer which virtualizes large-scale data resources, guard against data loss in long-term archiving and preservation, optimize access for users from different regions, bring data closer to powerful computers for compute-intensive analysis.


      Selection of data to be archived

      To select the data that will be archived for the long term, it is important to consider the value of the data:

      • Are the data unique, non-reproducible (or at too high a cost)? 
      • Do the data have historical value, i.e., do they represent a landmark in scientific discoveries? 
      • Do the data include changes in processing methods, new standards, or create precedents? 
      • Do the data support ongoing projects or scientific trends? 
      • Are the data likely to meet future needs/directions of the scientific community (reuse potential)? 
      • Are the data likely to be cited or referenced in a publication? 
      • ...

      • The quality and compliance of data collection must be controlled and documented. This may include processes such as calibration, sample or measurement repetition, standardized data capture, data entry validation, peer review... 
      • Quality, physical integrity of data (undamaged, readable...)

      • What is the policy of the funder, the institution? 
      • Are the data compliant with the institution's strategy?

      • Is there a legal or legislative reason to preserve the data? 
      • Is there an obvious reason why the data might be used in litigation, public inquiries, police investigations, or any report or document that could be challenged in court? 
      • Are there financial or contractual obligations that require data preservation?

      When considering data preservation, the cost of conservation (identified not only as storage, but also management, sharing, access, backup, and long-term data maintenance) must be weighed against evidence of potential data reuse.

      Consult the research archives management reference guide, Association of French Archivists, Aurore Section.


      Source : 

      Preparation of the data to be archived

      Here is a checklist to prepare your data for long-term archiving:

      1. Selection of datasets: The datasets (and associated metadata) selected may be different from the shared datasets.
      2. Volume: Evaluate the volume of data and the necessary budget.
      3. Data treatment: Treatment of some data may be necessary. For example, personal data requires anonymization.
      4. File formats: Check the validity of data file formats according to the recommendations of the archive selected.
      5. Software: Document and perhaps also provide the software used to access the data.
      6. Metadata: Complete and enrich the metadata if necessary, according to the recommendations of the archive selected.
    • 4. Play with the 3 steps of data storage

    • 1. Reuse and enhancement of data in the data lifecycle

      This is the final step in the data life cycle but also the starting point of a new cycle if the data are reused for a new research project.

      Re-using data
      Adaptation of Research data lifecycle – UK Data Service

      It is important to prepare the data for sharing in order to make it FAIR. This way other researchers can use them for new research projects. 

    • 2. Reuse and citation of data

      On the researcher's side

      In order to ensure that the research data they have generated can be reused under good conditions, researcher must adopt several good practices:

      Guide: 6 étapes pour partager ses données de recherche - FAIR, dépôt, licence, métadonnées, logiciels et DOI.


      On the user’s side

      There are several ways for a researcher to find reusable datasets:

      Guide utilisateur: 4 méthodes pour trouver des données - dépôt direct, annuaires, Google Dataset Search et data papers.

      Links :


      In all cases, re-users must respect certain rules:

      • Respect the intellectual property of the authors as mentioned in the licence
      • Cite the data if the license requires it (it is recommended to always cite its sources)
      • Link data to publications.


      Tip: there are tools to help you cite a dataset correctly, such as:

      • The "Cite all versions" part proposed by the Zenodo repository
      • The DOI Citation Formatter service which allows you to simply and automatically obtain a complete citation from a DOI.


      To complete, see the summary sheet on data citation proposed by DataOne:

      Source : https://dataoneorg.github.io/Education//lessons/08_citation/L08_DataCitation_Handout.pdf
    • 3. Data papers and data journals

      Writing and publishing a data paper is a good way for researchers to add value to their research data.

      A data paper is a publication that describes research datasets and associated metadata. It follows the same editorial process as traditional scientific articles:

      • Elements common to classic articles (title, abstract, keywords…)
      • Specific data elements (data types, formats, production processes and methods, metadata, reuse…)

      A data paper can be published in a data journal (journal dedicated to this type of publication) or in a classic journal that accepts data papers.

      Access to data from the data paper can be done in two ways:

      • the data are integrated in the article and published as supplementary data
      • the data are deposited in a repository and this is the persistent identifier (example: the DOI) that links data paper with data.

      An example of data papers in the domain of environment

      Two data papers were written on photographic data to study the evolution of vegetation phenology in different ecosystems across North America. The data are derived from automated digital images (taken every 30 minutes), collected via the PhenoCam network. The data are time series characterizing the color of the vegetation, including the degree of greening. The PhenoCam Explorer interface has been developed to facilitate data exploration and visualization, from which the user can also download data on a site-by-site basis. The images are also available in real time through the PhenoCam project web page.

      PhenoCam : grille de sites naturels avec images RGB et IR, statut de collecte et type de données pour 520 emplacements.
    • 4. Data exposure and visualization

      In addition to depositing data in a repository, and perhaps publishing a data paper, exposing the data is another good way to add value.

      Indeed, expose data in visual form (maps, graphs, etc.) via a platform is indicated especially in the case of large and complex data.

      Example 1

      These data (available on the ICOS Carbon Portal) comes from time series of values on hundreds of parameters. With visualization tools, we can see the evolution of CO2 concentrations over a year, coupled with the origin of the air mass. This would be very difficult to understand without data visualization.

      Interface STILT: carte et graphique CO2 2018 pour Puy de Dome, avec visualisation des flux et stations de mesure en Europe.

      Example 2

      CoReA: a digital library has been created with the Omeka tool for the archaeological documentation of the CNRS French Centre Camille Julian. It allows to navigate easily through archaeological corpus and resources.

      Interface de recherche : 770 résultats archéologiques avec menu de filtres et notices d'inscriptions antiques par J. Gascou

      Omeka is an open-source web publishing platform for sharing digital collections and creating media-rich online exhibits.

      From raw research data, the tool allows the creation of editorialized collections (structured, accessible, and visible on the web). The tool offers a great modularity of functionalities thanks to numerous plugins, and handles various multimedia objects (texts, images, sounds, videos).

      The tool offers several technical advantages:

      • the interface is simple and intuitive;
      • the metadata can be harvested, allowing in particular the referencing in other databases;
      • an Omeka collection can be connected to other services thanks to a REST API.

      Example 3

      See the example of data visualization from the "Republic of Letters", where researchers map thousands of letters exchanged in the 18th century and can learn very rapidly what it once took a lifetime of study to comprehend (Seen in the Lesson 1 Unit 2: Data and Science): https://www.youtube.com/watch?v=nw0oS-AOIPE.

    • 5. Play with data reuse and enhancement

      Instructions: Place each card on one of the two zones identified "On the researcher's side" and "on the user's side".



    • Badge OBERRED Research Data Management Context

      Test your knowledge on this first part of the introduction to data management and sharing.

      Success in this test is rewarded by an Open Badge! To pass this test, you must be enrolled in the course

      Check
    • Please answer these 5 questions related to the lesson to test your knowledge.
    • You can find your badge in your profile, in the badge section.


      👏 Congratulations👏

      You have successfully completed assessment 2 and obtained this open badge!Open Badge Procecceses (RDM)

      See the badge

      Not available unless: The activity Evaluation 2 is complete and passed
    • Congratulations, you have completed lesson 2!

      Thank you for taking this second lesson. You can get your Open Badge at the bottom of this page.

      But first, we invite you to respond to this survey about Lesson 2 "Concepts and processes of Research Data Management (RDM)".

      It will only take you a few minutes to answer these 5 questions and will help us to improve the MOOC.

      This survey is anonymous.


      Not available unless: The activity Evaluation 2 is complete and passed