inbo / ipt-dcat Goto Github PK

4.0 14.0 0.0 29 KB

📃 Data Catalog Vocabulary (DCAT) functionality for the IPT

License: MIT License

dcat specification gbif oscibio

ipt-dcat's Introduction

IPT-DCAT

Rationale

This project aims to make the GBIF Integrated Publish Toolkit (IPT) compliant with the Data Catalog Vocabulary application profile (DCAT-AP), by exposing Catalog, Dataset, and Distribution information in the IPT. This repository defines the EML to DCAT-AP mapping and describes the functional requirements to implement it in the IPT.

Resources

Fork of the IPT source code: this is where we'll implement the DCAT-AP functionality
DCAT to IPT mapping
DCAT-AP model: image of the DCAT-AP objects, properties and relationships.
DCAT-AP validator: in Swedish, but should work for Belgian DCAT-AP too

ipt-dcat's People

Contributors

Stargazers

Watchers

ipt-dcat's Issues

rdfs:Resource

After each distribution, this is listed:

<http://.../ipt-dcat/resource?r=glasaal> a rdfs:Resource .

Why is this? Is it required?

Validator suggestions

I just published a dataset and the resulting DCAT is valid 👍. The validator does give some warnings/suggestions though:

I'll list them here. Some of those could be implemented, others might be valid but not be recognized by the validator.

Catalog

Language: could be set to English
Themes: we do have dcat:themeTaxonomy and skos:ConceptScheme, but that is apparently not recognized
License: we have dct:rights, but should probably use dct:license?
Homepage: should we add one?

Dataset

Language: an IPT resource has a metadata language property. Would be useful to use.
Contact point: we have adms:contactPoint. Not sure why it is not recognized.

Content of dcat:themeTaxonomy is split for catalog

The current feed shows:

<http://.../ipt-dcat/#catalog>
 a dcat:Catalog ;
dct:title "IPT DCAT" ;
dct:description "IPT DCAT" ;
dct:publisher <http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> ;
dct:issued "2015-07-29T12:26+02:00" ;
dct:modified "2015-07-29T12:26+02:00" ;
dcat:themeTaxonomy <http://eurovoc.europa.eu/218403> ;
dct:rights <https://creativecommons.org/publicdomain/zero/1.0/> ;
dct:spatial [ a dct:Location ; locn:geometry "{ \"type\": \"Point\", \"coordinates\": [ 3.49,51.2 ] }" ] .

<http://eurovoc.europa.eu/218403> a skos:ConceptScheme ; dct:title "biodiversity"@en .

According to the example in the mapping, that last line should be part of dcat:themeTaxonomy instead of shown separately:

<http://.../ipt-dcat/#catalog>
 a dcat:Catalog ;
dct:title "IPT DCAT" ;
dct:description "IPT DCAT" ;
dct:publisher <http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> ;
dct:issued "2015-07-29T12:26+02:00" ;
dct:modified "2015-07-29T12:26+02:00" ;
dcat:themeTaxonomy <http://eurovoc.europa.eu/218403> a skos:ConceptScheme ; dct:title "biodiversity"@en ;
dct:rights <https://creativecommons.org/publicdomain/zero/1.0/> ;
dct:spatial [ a dct:Location ; locn:geometry "{ \"type\": \"Point\", \"coordinates\": [ 3.49,51.2 ] }" ] .

Or is this how it is supposed to be?

Test for DCAT dataset generation

Test if the dataset metadata is correctly generated

IPT info by @kbraak

I talked to @kbraak from GBIF regarding mapping EML to DCAT and here a couple of resources we could use.

EML is described at https://knb.ecoinformatics.org/#external//emlparser/docs/index.html
The GBIF IPT uses a GBIF profile of EML, defined at http://rs.gbif.org/schema/eml-gbif-profile/
The latest version of the EML GBIF profile is 1.1: http://rs.gbif.org/schema/eml-gbif-profile/1.1/eml-gbif-profile.xsd
The IPT uses as custom library to express EML as Java classes: https://github.com/gbif/gbif-metadata-profile. If a new version of the EML GBIF profile is published, this library is manually updated. The IPT only uses one version of the profile (the latest one).
GBIF has already done a EML mapping exercise, which could serve as an example: from EML to the metadata format used by DataCite (which issues DOIs).
They did this by expressing the DataCite metadata as Java classes (in https://github.com/gbif/gbif-doi), using an external plugin/library: see this line.
The IPT uses https://github.com/gbif/gbif-doi as a dependency to get DOI functionality (see pom file)
The actual mapping between EML and the DataCite metadata is done in the IPT code: https://github.com/gbif/ipt/blob/master/src/main/java/org/gbif/ipt/utils/DataCiteMetadataBuilder.java
This mapping is described at https://code.google.com/p/gbif-providertoolkit/wiki/IPT2DataCiteMappings

DCAT dataset dct:description is not the same as Eml#Description

Eml#Description links to the description for the IPT if I'm correct
I suppose the description for a dataset is Eml#Description

Also, do I need to keep the paragraphs?

Add foaf:homepage for Catalog

As described in the mapping file, add foaf:homepage to catalog and populate with IPT#HomepageURL.

Best Distribution dcat:mediaType?

If I understand mediaType, we could use the following for Darwin Core Archives:

zip: understandable, but very general, less informative of what to expect
dwc-a: less widely known, but gives good indication of what to expect

What do you suggest?

Decide on URIs for Publisher, Catalog, Dataset, and Distribution

The current proposal for the URIs is:

Publisher: not yet decided
Catalog: http://data.inbo.be/ipt#Catalog
Dataset: http://data.inbo.be/ipt/resource?r=bird-tracking-gull-occurrences#Dataset
Distribution: http://data.inbo.be/ipt/archive?r=bird-tracking-gull-occurrences

I'd like to let @timrobertson100 from GBIF weigh in on the possibility of having even nicer URIs provided by the IPT, e.g. with content negotiation.

(see also #5)

Test IPT to CKAN mapping

Here's my experience mapping dataset metadata from IPT to CKAN:

EML

IPT	CKAN
Shortname	URL slug name
Title	`Title`
Description	`Description`
Publishing Organisation	Create dataset under a specific organization
Update frequency
Type
Subtype
Metadata Language
Data Language
Data License	`License`
Resource Contacts
Resource Creators	One `Author` and `Author email`
Metadata Providers	One `Maintainer` and `Maintainer email`
Coordinates
Geographic coverage description
Taxonomic coverage description
Taxa
Temporal coverage
n/a keywords
GBIF keywords
CKAN keywords (to add to all datasets)	`Tags`
Associated parties
Project title
Project identifier
Project description
Project funding
Study area description
Design description
Project personnel
Study extent
Sampling description
Quality control
Step descriptions
Resource citation
Resource citation identifier
Bibliography
Collections
Specimen preservation methods
Curational units
Resource homepage
Other data formats
Date created
Data last published
Resource logo URL
Purpose
Maintenance description
Additional information
Alternative identifiers

Other than EML

IPT	CKAN
Visibility	`Visibility`
URL of resource	`Source`
Version	`Version`
URL of DwC-A	`Resource URL`
"Darwin Core Archive"	`Resource name`
Some text	`Resource description`
"DwC-A"	Format

Extra

Usage norms?

Test for DCAT file generation

Create test for:

Dataset data
Catalog data
Distribution data
DCAT file

Test DCAT data is updated when publishing new data

Test if DCAT data is updated

Published but unregistered resources are not listed in DCAT feed

If a dataset is published, but not registered with GBIF, it will appear on the homepage (with organization Not registered), but it is not listed in the DCAT feed. Is this by design?

How are dataset versions handled in DCAT

The IPT supports versions, for both datasets and distributions (both have the same version number and increase it at the same time). Example:

A harvester, like GBIF or the Flemish Open Data platform, generally wants to update its entry of the dataset (e.g. updated title, description and version number) and distribution (replacing old one with new one and increasing version number).

What is the best way to express versions in DCAT?

By listing only the latest version of the dataset, with the latest distribution
By listing only the latest version of the dataset, with all distribution versions
By listing all versions of the dataset, each one with its own distribution

@pietercolpaert, you mentioned:

Separate versions can be different resources which point to a generic dataset

What is then the title and description of the generic dataset? That of the latest version?

Where does the DCAT need to be updated/made

Entry points for the DCAT feed?

Is the last modification the same as lastpublishdate?

In a dataset there is need for a dct:modified and in the mapping it explains this as date of last modification in the IPT.
Now, I suppose the last modification is the same as the publication date?

Adding the expires header to the HTTP request

Adding a HTTP expires header when accesing the dcat feed

Theoretical mapping of EML to DCAT

As we plan to use this mapping in the context of the IPT, I would not map the main EML standard, but the EML GBIF profile specifically. The mapping can be (first) described as a document (cf. this document)

Question: what version of DCAT should we map to (Belgium, European, etc.)? It should work for the Flemish Open Data Platform and ideally all CKAN instances worldwide.

Timestamp in DCAT feed is off by an hour

Setup: a test IPT registered with a test organization. No published datasets.

If I generate the DCAT feed on 11:01 GMT+2, the timestamp in the feed is 10:01 GMT+2. Is this an error in the code or is something wrong on the server I'm using?

dct:issued "2015-07-29T10:01+02:00" ;
dct:modified "2015-07-29T10:01+02:00" ;

DCAT catalog information

Creation of the DCAT catalog

How to define the spatial data?

There has to be a resource for the spatial data in a dataset. The IPT defines two points with a longitude and latitude.

My thoughts:

dct:spatial [ geo:point [ geo:lat "65" ; geo: long "36" ] ; geo:point [ geo:lat "34" ; geo: long "15" ] ]

Complete list of DCAT-AP properties

The mapping currently contains the mandatory properties. @pietercolpaert, can you complete this with optional (useful) ones? No rush.

Dataset mandatory according to validator

Setup: a test IPT registered with a test organization. No published datasets.

If I paste the DCAT feed into the validator, I get one error:

The property: dataset is mandatory

So, if there are no published datasets, should we generate a feed at all?

Can dataset dcat:keywords be grouped in sets?

IPT keywords are grouped in thesauri. It would be useful to retain that information in DCAT, so that a harvester like the Flemish Open Data platform only imports keywords from a certain thesaurus (e.g. the one we specify with only Dutch keywords). Is there a way to express this in DCAT?

Error messages on every screen

I'm testing the IPT-DCAT on our development server and I get green error/warning messages on every screen (e.g. while editing metadata):

Is there as certain setting causing this or are those errors valid?

Dataset modified

DCAT Dataset title, description, keywords, contacts, etc. will be those as currently populated in the IPT, not those of the last published version of a resource (I need to actually verify this). For Dataset modified date however, we'll use the last publication date, as it is a more important date, which increases with the version number. That means however that some elements (title, description, keywords, etc.) might change without the modified date to change.

@timrobertson100, knowing this, would you suggest to map to last modified date instead of last published date?

Test for DCAT distribution generation

Test if the distribution metadata is correctly generated

vcard:Kind instead of vcard:Individual

In adms:contactPoint, the vcard is of type Kind instead of Individual (Individual is a good assumption for 99% of the contact points).

adms:contactPoint [ a vcard:Kind ; vcard:fn "Peter Desmet"  ; vcard:hasEmail <mailto:[email protected]> ] ;

Test DCAT Prefixes generation

Test for the correct generation of the prefixes

What organisation name should be used for the datacatalog?

The DCAT catalog needs to have a publisher.
Now both the IPT manager and the Organisation who publishes datasets have a name and can be different. (You can have multiple organisations)
Which name should we use for the publisher of the catalog?

Write code to generate a DCAT file from EML

Once we have the mapping described (see #1), we can implement it so that when a dataset is (re)published on the IPT, a DCAT file of the metadata is created in addition to the EML metadata. There are two different approaches:

As part of the IPT code base

Advantages:

We can make use of the EML Java classes exposed by the IPT (https://github.com/gbif/gbif-metadata-profile)
We can potentially make use of the same plugin as gbif-doi to express DCAT as GBIF classes (see resources)
We can tap in the publication code to create a DCAT metadata file each time a dataset is (re)published
The functionality becomes part of the IPT (open source) code base and could in collaboration with GBIF be rolled out for all IPTs

Disadvantages:

We're restricted to Java to write the functionality

As a separate script

Advantages:

More choice of software language
No constraints imposed by IPT code

Disadvantages

None of the advantages of making it part of the IPT code
The script has to rely on the published EML file instead of the EML standard expressed as classes
The script has to be triggered in some way separately from the IPT (e.g. a cron job)

Odd whitespace and punctuation in dcat feed

Setup: a test IPT registered with a test organization. No published datasets.

The dcat feed has some . or ; at the end of each lines. Is this valid? And if required, wouldn't it be better to remove the whitespace before? E.g. @prefix schema: <http://schema.org/> . -> @prefix schema: <http://schema.org/>.?

@prefix adms: <http://www.w3.org/ns/adms#> .
@prefix schema: <http://schema.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix locn: <http://www.w3.org/ns/locn#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .

<http://.../ipt-dcat/#catalog>
 a dcat:Catalog ;
dct:title "IPT DCAT" ;
dct:description "IPT DCAT" ;
dct:publisher <http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> ;
dct:issued "2015-07-29T09:19+02:00" ;
dct:modified "2015-07-29T09:19+02:00" ;
dcat:themeTaxonomy <http://eurovoc.europa.eu/218403> ;
dct:rights <https://creativecommons.org/publicdomain/zero/1.0/> .

<http://eurovoc.europa.eu/218403> a skos:ConceptScheme ; dct:title "biodiversity"@en .

<http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> a foaf:Agent ; foaf:name "IPT DCAT" .

Test DCAT validation

Validation if the DCAT syntax is correct

Caching of the information

A way to cache the information of the DCAT feed

The output of the DCAT feed is simply stored as a String
When the DCAT feed is asked the GenerateDCAT class will look at the time the String was created. If it the current time is larger than the creating time plus the caching time, the DCAT will be regenerated.

What is catalog dct:rights

Is this the license of the catalog list itself or the license of all datasets (which might be different for each dataset)?

Modified timestamp of Catalog is set to current time.

Related to #42. According to the documentation, the modified timestamp of a Catalog should be equal to the Latest Resource#LastPublished. Currently it seems to take the current time and updates on every reload.

Example for catalog dcat:themeTaxonomy

Do you have an example for dcat:themeTaxonomy, so I have an idea of how this could be populated?

DCAT feed generated multiple times

Setup: a test IPT registered with a test organization. No published datasets.

The DCAT feed I get is this (note: I have hidden one URL):

@prefix adms: <http://www.w3.org/ns/adms#> .
@prefix schema: <http://schema.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix locn: <http://www.w3.org/ns/locn#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .

<http://.../ipt-dcat/#catalog>
 a dcat:Catalog ;
dct:title "IPT DCAT" ;
dct:description "IPT DCAT" ;
dct:publisher <http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> ;
dct:issued "2015-07-29T09:19+02:00" ;
dct:modified "2015-07-29T09:19+02:00" ;
dcat:themeTaxonomy <http://eurovoc.europa.eu/218403> ;
dct:rights <https://creativecommons.org/publicdomain/zero/1.0/> .

<http://eurovoc.europa.eu/218403> a skos:ConceptScheme ; dct:title "biodiversity"@en .

<http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> a foaf:Agent ; foaf:name "IPT DCAT" .

If I wait a couple of minutes and reload, I get this:

@prefix adms: <http://www.w3.org/ns/adms#> .
@prefix schema: <http://schema.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix locn: <http://www.w3.org/ns/locn#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .

<http://.../ipt-dcat/#catalog>
 a dcat:Catalog ;
dct:title "IPT DCAT" ;
dct:description "IPT DCAT" ;
dct:publisher <http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> ;
dct:issued "2015-07-29T09:19+02:00" ;
dct:modified "2015-07-29T09:19+02:00" ;
dcat:themeTaxonomy <http://eurovoc.europa.eu/218403> ;
dct:rights <https://creativecommons.org/publicdomain/zero/1.0/> .

<http://eurovoc.europa.eu/218403> a skos:ConceptScheme ; dct:title "biodiversity"@en .

<http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> a foaf:Agent ; foaf:name "IPT DCAT" .

@prefix adms: <http://www.w3.org/ns/adms#> .
@prefix schema: <http://schema.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix locn: <http://www.w3.org/ns/locn#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .

<http://.../ipt-dcat/#catalog>
 a dcat:Catalog ;
dct:title "IPT DCAT" ;
dct:description "IPT DCAT" ;
dct:publisher <http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> ;
dct:issued "2015-07-29T09:49+02:00" ;
dct:modified "2015-07-29T09:49+02:00" ;
dcat:themeTaxonomy <http://eurovoc.europa.eu/218403> ;
dct:rights <https://creativecommons.org/publicdomain/zero/1.0/> .

<http://eurovoc.europa.eu/218403> a skos:ConceptScheme ; dct:title "biodiversity"@en .

<http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> a foaf:Agent ; foaf:name "IPT DCAT" .

The information is repeated, only the issued/modified timestamps are different. If I wait and reload after that, the content gets repeated again. I don't think this is intentional.

Hash at end of URI might conflict with Angular

@timrobertson100, care to expand?

Theme of a dataset?

The theme of a dataset and the themeTaxonomy of the catalog refer to the same URI: http://eurovoc.europa.eu/5463. But the themeTaxonomy needs a skos:ConceptScheme, while the theme needs a skos:Concept

Is the given URI for the global catalog, or for each dataset?

Mapping Dataset dcat:landingPage

Still a gap in the documentation: How can we retrieve the resource URL (e.g. http://data.inbo.be/ipt/resource?r=bird-tracking-gull-occurrences) from the code?

Entry point when publishing dataset

Finding the entry point for publishing datasets
Make sure the DCAT file is updated there

What is catalog foaf:homepage?

Is this the homepage of the catalog, or rather publisher or dataset?

Mapping Dataset dct:identifier

We'll add dct:identifier to the dataset, which ideally is populated with the DOI of the dataset and if not available the GBIF registry key.

2 questions for @timrobertson100

Is the IPT aware of the DOI assigned by GBIF or only of DOIs assigned via the IPT?

What format do we choose for the identifier: URL or none URL?

http://doi.org/10.15468/02omly
doi:10.15468/02omly
http://www.gbif.org/dataset/83e20573-f7dd-4852-9159-21566e1e691e
83e20573-f7dd-4852-9159-21566e1e691e

DCAT dataset information for 1 resource

Creating the DCAT information for one dataset

Struts problem on dcat feed

When I want to access the dcat feed on the IPT I am testing, I get this:

Any idea how to prevent this?

What URI should we use for Publisher?

The Publisher of a Catalog/Dataset is a resource, so it needs a URI. What URI should we use? There is no page/URI for this in the IPT (unlike Catalog, Dataset and Distribution), but there is at GBIF (e.g. http://www.gbif.org/publisher/1cd669d0-80ea-11de-a9d0-f1765f95f18b). That URI would not only work for INBO, but for all organizations using the IPT for GBIF. It does land on a HTML representation of the publisher though. Any ideas?

Incorrect publisher for Catalog

GBIF registers organizations (e.g. INBO) and IPT installations (e.g. the INBO IPT).

In the DCAT feed, the catalog has a publisher. This is currently mapped to the IPT installation (IPT DCAT in my test), not the organization (or organizations!) using that installation as a publisher (INBO). I think it should be the latter.

<http://.../ipt-dcat/#catalog>
 a dcat:Catalog ;
dct:title "IPT DCAT" ;
dct:description "IPT DCAT" ;
dct:publisher <http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> ;
dct:issued "2015-07-29T12:26+02:00" ;
dct:modified "2015-07-29T12:26+02:00" ;
dcat:themeTaxonomy <http://eurovoc.europa.eu/218403> ;
dct:rights <https://creativecommons.org/publicdomain/zero/1.0/> ;
dct:spatial [ a dct:Location ; locn:geometry "{ \"type\": \"Point\", \"coordinates\": [ 3.49,51.2 ] }" ] .

<http://eurovoc.europa.eu/218403> a skos:ConceptScheme ; dct:title "biodiversity"@en .

<http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> a foaf:Agent ; foaf:name "IPT DCAT" .

Note 1: The URL http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5 won't return anything, because those are test environment UUIDs.

Note 2: the URL of the registered IPT installation (i.e. the catalog) might be useful information. @pietercolpaert, is there another term we can use for this?

Time format in DCAT?

The Date class used in the IPT to define a date is deprecated. A DCAT date needs to be formatted in the ISO8601 standard. This is only supported since Java 8.
Should I parse the deprecated class Date or can I convert to the Java 8 standard?

inbo / ipt-dcat Goto Github PK

ipt-dcat's Introduction

IPT-DCAT

Rationale

Resources

ipt-dcat's People

Contributors

Stargazers

Watchers

ipt-dcat's Issues

Catalog

Dataset

EML

Other than EML

Extra

As part of the IPT code base

As a separate script

2 questions for @timrobertson100

Recommend Projects

Recommend Topics

Recommend Org