IC 8 Catalogue, curation, provenance
The implementation case aims at fulfilling requirements for curation, cataloguing and provenance.
The targeted usages are:
- support operations in RI (e.g. network of observation system maintainance with sensor registry)
- support downstream users (e.g. discover available datasets)
- support long term sustainability of the information (e.g. preserve datasets or software).
Catalogue is used for discovery (finding items of interest), contextualisation (determining relevance and quality) , access (connecting together users, datasets, software, resources to achieve the user end-objective).
Items described in catalogues are among: datasets, systems and resources for observation and processing, observations event and results (e.g. samples), documents, persons, research objects.
Provenance and curation functions rely on catalogue as a back-end repository, as input or output
Provenance relates to contextualisation. It provides functions writing, updating and reading catalogue to complete discovery and access with services determining relevance and quality of the items described in catalogues.
Provenance being well covered in other implemention cases (IC_2 mostly, but IC_6 and IC_9 as well), the current implementation case will collaborate with them for requirements and fulfill them so to demonstrate a couple of provenance functions: to be listed (Barbara) 2 functions related to dataset's provenance.
Curation relates to the data management processes required to ensure availability of digital assets (datasets, software) through media migration to ensure physical readability, redundant copies to ensure availability, appropriate security and privacy measures to ensure reliability and appropriate catalogue content maintenance to ensure discovery, contextualisation and access to this digital assets.
The current implementation case fulfill requirements for a couple of curation functions: (a) automated media migration of datasets to ensure continued availability and readability; (b) discovery of a curated dataset along with appropriate curated software and operating environment
|Background||Contact Person||Organization||Contact email|
|_<Choose one of the following roles: [RI-ICT||RI-Domain||ICT||e-Infrastructure]>_|
|RI (Use Case proposer, Agile Group leader)||Thomas Loubrieu||IFREMER||Thomas.email@example.com|
|RI||Chrstian Pichot, Andre Chanzy||INRA (ANAEE)||firstname.lastname@example.org|
|ITC||Marco Rorro Giovanni Morelli The persons who managed
CKAN for EUDAT would be perfect here !
|RI||Barbara Magagna, Johannes Peterseil||LTER||Barbara.email@example.com|
|Task 5.4||Zhiming Zhao, Paul Martin||UvAfirstname.lastname@example.org, P.W.Martin@uva.nl|
Use case type
- Each RI describes its portfolio of new and/or enhanced services that they expect from ENVRIPLUS results, derived from the ENVRIPLUS WPs;
- ENVRIPLUS staff work with the RIs on these descriptions, which in the course of the project will be gradually updated with more details.
- Implementation cases selected and adopted by interested RIs;
- Both RIs and ENVRIPLUS invest in the actual implementation and associated services.>
Scientific domain and communities
To be relevant in ENVRIPLUS context, the implementated functions must be validated by at least 2 RIs, preferably in 2 different spheres (bio, liquid, solid, gas):
- Atmosphere: IAGOS
- Biosphere: ANAEE
- Geosphere: EPOS
- Hydrosphere: SeaDataNet, Euro-ARGO
- data acquisition community (e.g. observation system catalogue)
- data curation community maintain the catalogues and use them as a management tool for datasets (and other digital objects) inventory.
- data publication community uses catalogues to parameterize their actions (e.g. mint a DOI an an object requires metadata)
- data service provision community use catalogues to configure their services, including workflow composition to retrieve input datasets and process them, but they might not be considered as a first priority target.
- data usage community use catalogue for discovery and with contextualisation including provenance for asessment of relevance and quality and for traceability. Data curation community is targeted for curation as a provider but as well by the other community who uses curated information to work.
The connected behaviours are:
Data acquisition community:
- Instrument configuration and calibration need to be registered in catalogues.
- Data collection: Data collected from sensors need to be curated, at least safe guarded in early stage on replicated storage. The description of the observation need to be pushed in catalogue. Data curation community:
- Data quality checking: the quality assesment performed on dataset need to be documented in catalogue. Different versions of same datasets with different quality control performed need to be managed.
- Data preservation: catalogue and datasets need to be preserved for long term. With replicated copies, format maintenance and a data management plan (DMP)
- Data product generation: input and output datasets of the products need to be managed in catalogues and curated. For provenance the description of the product processing need to be managed in catalogue as well.
- Data replication is handle by curation sub-use case. Data publication community:
- Data publication: the information managed in catalogue and curated datasets should be clean enough for publication
- Semantic hamonisation: the content of catalogue and datasets will be homogeneous syntactically (format) and rely on harmonized thesaurus references (e.g .SKOS) to support semantic harmonisation.
- Data discovery and access: data discovery will be enabled in a harmonized way. Visualization and download access will be provided when available but not harmonized. The data access provided here will not take benefit of the provenance information (user tracking, profile, ...)
- Data citation: the metadata required for data citation will be available in catalogue. The DOIs or PID of datasets will be described in catalogue as well. However the function of creating a DOI is not manage by the use case.
Data Service provision community:
- service description and registration: the service should be described in a catalogue, however this may not be a first priority in this use case to manage this information.
- Service coordination and composition can use catalogues of datasets and services to schedule and organize service. However as said above this is not a priority in the current use case. Data usage community: behaviours of the community will be supported by the catalogue especially in aspects of discovery, contextualisation (including provenance), availablity (through curation).
- User profile management: Users are recorded in the catalogue and by matching processes harmonizes with other user directories (e.g. OrcID) managed in the use case.
Objective and Impact
The catalogue aims at providing functions cross-cutting RI, to edit and discover the following items:
- systems for observation and processing (processing is in lowest priority)
- observations event and results (e.g. samples)
- research objects (lowest priority)
Action 1: Persons and documents will be described and federated in pre-existing e-infrastructures, to be defined (e.g. orcID, …) so to fulfill requirements for the provenance and curation functions.
Action 2: Datasets description will be federated from harvesting the datasets catalogue (in whatever 'standard' metadata format) in each RI in a single entry point (metadata format to be chosen among: DC, DCAT, INSPIRE/ISO19115, geonetworks, CKAN, CERIF ) to be defined so to fulfill requirements for the provenance and curation functions.
Action 3: Observation systems, events and results (including collected samples) edition and discovery functions will be implemented by a combination of RI specific tools and federated tools (e.g. for edition) so to fulfill requirements for the provenance and curation functions.
The main challenge is the involvement of RI, from definition of the functions to the adoption of the solution.
In the context of the 3 above actions:
- define curation and provenance functions to be provided, identify related requirement on catalogue (format and access API).
- define catalogue requirements for discovery and access
- define metadata profile and access API
- implement the centralized or federated solution
As for AGILE, the steps can be iterative by having new iteration for new requirements identified or RI supported.
Technical status and requirements
E-infrastructures which manage catalogues of persons and documents are existing, available through standard interfaces and cross-cutting RI.
Catalogues of datasets are generally provided by RI and their content is available through standard interfaces. Some tools are available on the shelf to implement the catalogue of datasets (DC, DCAT, INSPIRE/ISO19115, geonetworks, CKAN, CERIF). ENVRIPLUS need to federate them by utilising the richest available 'standard' and providing mappings to the others.
Catalogue of observation systems, events or samples may exists in RI. They are seldom or never accessible through standard interfaces. Some RI lack proper tools to manage these information which is however critical for the good quality and traceability of scientific results.
Implementation plan and timetable
Documents and persons
E-infrastructures which manage catalogues of persons and documents are existing.
The implementation case will define a list of official sustainable person and document repository which should be used by RI to describe their resources. and define mappings to/from the ENVRIPLUS catalogue metadata standard (when chosen)
Expected result in Octobre 2016
The implementation case will identify catalogues of datasets in RI and analyse their machine to machine interface for harvesting purpose. A single tool will harvest them centrally. Then their metadata will require conversion from local RI format to that of the ENVRIPLUS central catalogue as described above.
Expected result in Octobre 2017
Observation systems, events or samples
An integrated system will shows observations systems, events and collected samples from 2 or 3 RI in liquid (EMSO, ARGO), solid (EPOS) and gas (ICOS) spheres.
Tools will be provided to easily edit the descriptions for RI which would not have their own system yet.
As before this will rquire mapping the metadata describing systems, events, samples at each RI to the common metadata standard of ENVRIPLUS.
Expected result in Octobre 2018
Expected output and evaluation of output
Documents and persons
number of RI actually using the chosen person and document e-infrastructure to identify their resources.
Number of RI which dataset results descriptions are available in the federated system.
Number of users of the federated dataset catalogue (inside or outside the RI).
Observation systems, events or samples
Number of observation systems which events and results are actually available in the federated catalogue.
Number of users of the catalogues as support of the activities in the RI.