IC_8 Catalogue, curation, provenance

Background[edit]

Short description[edit]

The implementation case aims at fulfilling requirements for curation, cataloguing and provenance.
The targeted usages are:

support operations in RI (e.g. network of observation system maintainance with sensor registry)
support downstream users (e.g. discover available datasets)
support long term sustainability of the information (e.g. preserve datasets or software).

Catalogue is used for discovery (finding items of interest), contextualisation (determining relevance and quality) , access (connecting together users, datasets, software, resources to achieve the user end-objective).
Items described in catalogues are among: datasets, systems and resources for observation and processing, observations event and results (e.g. samples), documents, persons, research objects.
Provenance and curation functions rely on catalogue as a back-end repository, as input or output

Provenance relates to contextualisation. It provides functions writing, updating and reading catalogue to complete discovery and access with services determining relevance and quality of the items described in catalogues.
Provenance being well covered in other implemention cases (IC_2 mostly, but IC_6 and IC_9 as well), the current implementation case will collaborate with them for requirements and fulfill them so to demonstrate a couple of provenance functions: to be listed (Barbara) 2 functions related to dataset's provenance.

Curation relates to the data management processes required to ensure availability of digital assets (datasets, software) through media migration to ensure physical readability, redundant copies to ensure availability, appropriate security and privacy measures to ensure reliability and appropriate catalogue content maintenance to ensure discovery, contextualisation and access to this digital assets.
The current implementation case fulfill requirements for a couple of curation functions: (a) automated media migration of datasets to ensure continued availability and readability; (b) discovery of a curated dataset along with appropriate curated software and operating environment

Contact[edit]

Background	Contact Person	Organization	Contact email
_<Choose one of the following roles: [RI-ICT	RI-Domain	ICT	e-Infrastructure]>_
RI (Use Case proposer, Agile Group leader)	Thomas Loubrieu	IFREMER	Thomas.loubrieu@ifremer.fr
RI	Keith Jeffery		Keith.Jeffery@keithgjefferyconsultants.co.uk
RI	Chrstian Pichot, Andre Chanzy	INRA (ANAEE)	christian.pichot@paca.inra.fr andre.chanzy@avignon.inra.fr
ITC	Marco Rorro Giovanni Morelli The persons who managed CKAN for EUDAT would be perfect here !	EUDAT	M.rorro@cineca.it g.morelli@cineca.it
RI	Damien Boulanger	IAGOS	damien.boulanger@obs-mip.fr
RI	Maggie Hellstrom	ICOS	margareta.hellstrom@nateko.lu.se
RI	Barbara Magagna, Johannes Peterseil	LTER	Barbara.magagna@umweltbundesamt.at Johannes.peterseil@umweltbundesamt.at
Task 5.4	Zhiming Zhao, Paul Martin	UvA	z.zhao@uva.nl, P.W.Martin@uva.nl

Use case type[edit]

Implementation case

Conditions:

Each RI describes its portfolio of new and/or enhanced services that they expect from ENVRIPLUS results, derived from the ENVRIPLUS WPs;
ENVRIPLUS staff work with the RIs on these descriptions, which in the course of the project will be gradually updated with more details.

Implications:

Implementation cases selected and adopted by interested RIs;
Both RIs and ENVRIPLUS invest in the actual implementation and associated services.>

Scientific domain and communities[edit]

Scientific domain

To be relevant in ENVRIPLUS context, the implementated functions must be validated by at least 2 RIs, preferably in 2 different spheres (bio, liquid, solid, gas):

Atmosphere: IAGOS
Biosphere: ANAEE
Geosphere: EPOS
Hydrosphere: SeaDataNet, Euro-ARGO

Community

data acquisition community (e.g. observation system catalogue)
data curation community maintain the catalogues and use them as a management tool for datasets (and other digital objects) inventory.
data publication community uses catalogues to parameterize their actions (e.g. mint a DOI an an object requires metadata)
data service provision community use catalogues to configure their services, including workflow composition to retrieve input datasets and process them, but they might not be considered as a first priority target.
data usage community use catalogue for discovery and with contextualisation including provenance for asessment of relevance and quality and for traceability. Data curation community is targeted for curation as a provider but as well by the other community who uses curated information to work.

Behavior

The connected behaviours are:
Data acquisition community:

Instrument configuration and calibration need to be registered in catalogues.
Data collection: Data collected from sensors need to be curated, at least safe guarded in early stage on replicated storage. The description of the observation need to be pushed in catalogue. Data curation community:
Data quality checking: the quality assesment performed on dataset need to be documented in catalogue. Different versions of same datasets with different quality control performed need to be managed.
Data preservation: catalogue and datasets need to be preserved for long term. With replicated copies, format maintenance and a data management plan (DMP)
Data product generation: input and output datasets of the products need to be managed in catalogues and curated. For provenance the description of the product processing need to be managed in catalogue as well.
Data replication is handle by curation sub-use case. Data publication community:
Data publication: the information managed in catalogue and curated datasets should be clean enough for publication
Semantic hamonisation: the content of catalogue and datasets will be homogeneous syntactically (format) and rely on harmonized thesaurus references (e.g .SKOS) to support semantic harmonisation.
Data discovery and access: data discovery will be enabled in a harmonized way. Visualization and download access will be provided when available but not harmonized. The data access provided here will not take benefit of the provenance information (user tracking, profile, ...)
Data citation: the metadata required for data citation will be available in catalogue. The DOIs or PID of datasets will be described in catalogue as well. However the function of creating a DOI is not manage by the use case.

Data Service provision community:

service description and registration: the service should be described in a catalogue, however this may not be a first priority in this use case to manage this information.
Service coordination and composition can use catalogues of datasets and services to schedule and organize service. However as said above this is not a priority in the current use case. Data usage community: behaviours of the community will be supported by the catalogue especially in aspects of discovery, contextualisation (including provenance), availablity (through curation).
User profile management: Users are recorded in the catalogue and by matching processes harmonizes with other user directories (e.g. OrcID) managed in the use case.

Detailed description[edit]

Objective and Impact

Catalogue
The catalogue aims at providing functions cross-cutting RI, to edit and discover the following items:

systems for observation and processing (processing is in lowest priority)
observations event and results (e.g. samples)
datasets
documents
persons
research objects (lowest priority)

Action 1: Persons and documents will be described and federated in pre-existing e-infrastructures, to be defined (e.g. orcID, …) so to fulfill requirements for the provenance and curation functions.
Action 2: Datasets description will be federated from harvesting the datasets catalogue (in whatever 'standard' metadata format) in each RI in a single entry point (metadata format to be chosen among: DC, DCAT, INSPIRE/ISO19115, geonetworks, CKAN, CERIF ) to be defined so to fulfill requirements for the provenance and curation functions.
Action 3: Observation systems, events and results (including collected samples) edition and discovery functions will be implemented by a combination of RI specific tools and federated tools (e.g. for edition) so to fulfill requirements for the provenance and curation functions.

Challenges

The main challenge is the involvement of RI, from definition of the functions to the adoption of the solution.

Detailed scenarios

In the context of the 3 above actions:

define curation and provenance functions to be provided, identify related requirement on catalogue (format and access API).
define catalogue requirements for discovery and access
define metadata profile and access API
implement the centralized or federated solution

As for AGILE, the steps can be iterative by having new iteration for new requirements identified or RI supported.

Technical status and requirements

E-infrastructures which manage catalogues of persons and documents are existing, available through standard interfaces and cross-cutting RI.
Catalogues of datasets are generally provided by RI and their content is available through standard interfaces. Some tools are available on the shelf to implement the catalogue of datasets (DC, DCAT, INSPIRE/ISO19115, geonetworks, CKAN, CERIF). ENVRIPLUS need to federate them by utilising the richest available 'standard' and providing mappings to the others.
Catalogue of observation systems, events or samples may exists in RI. They are seldom or never accessible through standard interfaces. Some RI lack proper tools to manage these information which is however critical for the good quality and traceability of scientific results.

Implementation plan and timetable

Documents and persons
E-infrastructures which manage catalogues of persons and documents are existing.
The implementation case will define a list of official sustainable person and document repository which should be used by RI to describe their resources. and define mappings to/from the ENVRIPLUS catalogue metadata standard (when chosen)
Expected result in Octobre 2016

Datasets
The implementation case will identify catalogues of datasets in RI and analyse their machine to machine interface for harvesting purpose. A single tool will harvest them centrally. Then their metadata will require conversion from local RI format to that of the ENVRIPLUS central catalogue as described above.
Expected result in Octobre 2017

Observation systems, events or samples
An integrated system will shows observations systems, events and collected samples from 2 or 3 RI in liquid (EMSO, ARGO), solid (EPOS) and gas (ICOS) spheres.
Tools will be provided to easily edit the descriptions for RI which would not have their own system yet.
As before this will rquire mapping the metadata describing systems, events, samples at each RI to the common metadata standard of ENVRIPLUS.
Expected result in Octobre 2018

Expected output and evaluation of output

Documents and persons
number of RI actually using the chosen person and document e-infrastructure to identify their resources.

Datasets
Number of RI which dataset results descriptions are available in the federated system.
Number of users of the federated dataset catalogue (inside or outside the RI).

Observation systems, events or samples
Number of observation systems which events and results are actually available in the federated catalogue.
Number of users of the catalogues as support of the activities in the RI.

External Links[edit]

IC_8 notebook: {+}https://envriplus.manageprojects.com/projects/wp9-service-validation-and-deployment-1/notebooks/659+

IC_8 Catalogue, curation, provenance

Contents

Background[edit]

Short description[edit]

Contact[edit]

Use case type[edit]

Scientific domain and communities[edit]

Detailed description[edit]

External Links[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Categories

Tools

Misc