Giter Site home page Giter Site logo

ipt-dcat's Introduction

IPT-DCAT

Rationale

This project aims to make the GBIF Integrated Publish Toolkit (IPT) compliant with the Data Catalog Vocabulary application profile (DCAT-AP), by exposing Catalog, Dataset, and Distribution information in the IPT. This repository defines the EML to DCAT-AP mapping and describes the functional requirements to implement it in the IPT.

Resources

ipt-dcat's People

Contributors

peterdesmet avatar simon-vc avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ipt-dcat's Issues

rdfs:Resource

After each distribution, this is listed:

<http://.../ipt-dcat/resource?r=glasaal> a rdfs:Resource .

Why is this? Is it required?

Validator suggestions

I just published a dataset and the resulting DCAT is valid ๐Ÿ‘. The validator does give some warnings/suggestions though:

screen shot 2015-07-29 at 13 05 37

I'll list them here. Some of those could be implemented, others might be valid but not be recognized by the validator.

Catalog

  1. Language: could be set to English
  2. Themes: we do have dcat:themeTaxonomy and skos:ConceptScheme, but that is apparently not recognized
  3. License: we have dct:rights, but should probably use dct:license?
  4. Homepage: should we add one?

Dataset

  1. Language: an IPT resource has a metadata language property. Would be useful to use.
  2. Contact point: we have adms:contactPoint. Not sure why it is not recognized.

Content of dcat:themeTaxonomy is split for catalog

The current feed shows:

<http://.../ipt-dcat/#catalog>
 a dcat:Catalog ;
dct:title "IPT DCAT" ;
dct:description "IPT DCAT" ;
dct:publisher <http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> ;
dct:issued "2015-07-29T12:26+02:00" ;
dct:modified "2015-07-29T12:26+02:00" ;
dcat:themeTaxonomy <http://eurovoc.europa.eu/218403> ;
dct:rights <https://creativecommons.org/publicdomain/zero/1.0/> ;
dct:spatial [ a dct:Location ; locn:geometry "{ \"type\": \"Point\", \"coordinates\": [ 3.49,51.2 ] }" ] .

<http://eurovoc.europa.eu/218403> a skos:ConceptScheme ; dct:title "biodiversity"@en .

According to the example in the mapping, that last line should be part of dcat:themeTaxonomy instead of shown separately:

<http://.../ipt-dcat/#catalog>
 a dcat:Catalog ;
dct:title "IPT DCAT" ;
dct:description "IPT DCAT" ;
dct:publisher <http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> ;
dct:issued "2015-07-29T12:26+02:00" ;
dct:modified "2015-07-29T12:26+02:00" ;
dcat:themeTaxonomy <http://eurovoc.europa.eu/218403> a skos:ConceptScheme ; dct:title "biodiversity"@en ;
dct:rights <https://creativecommons.org/publicdomain/zero/1.0/> ;
dct:spatial [ a dct:Location ; locn:geometry "{ \"type\": \"Point\", \"coordinates\": [ 3.49,51.2 ] }" ] .

Or is this how it is supposed to be?

IPT info by @kbraak

I talked to @kbraak from GBIF regarding mapping EML to DCAT and here a couple of resources we could use.

  1. EML is described at https://knb.ecoinformatics.org/#external//emlparser/docs/index.html
  2. The GBIF IPT uses a GBIF profile of EML, defined at http://rs.gbif.org/schema/eml-gbif-profile/
  3. The latest version of the EML GBIF profile is 1.1: http://rs.gbif.org/schema/eml-gbif-profile/1.1/eml-gbif-profile.xsd
  4. The IPT uses as custom library to express EML as Java classes: https://github.com/gbif/gbif-metadata-profile. If a new version of the EML GBIF profile is published, this library is manually updated. The IPT only uses one version of the profile (the latest one).
  5. GBIF has already done a EML mapping exercise, which could serve as an example: from EML to the metadata format used by DataCite (which issues DOIs).
  6. They did this by expressing the DataCite metadata as Java classes (in https://github.com/gbif/gbif-doi), using an external plugin/library: see this line.
  7. The IPT uses https://github.com/gbif/gbif-doi as a dependency to get DOI functionality (see pom file)
  8. The actual mapping between EML and the DataCite metadata is done in the IPT code: https://github.com/gbif/ipt/blob/master/src/main/java/org/gbif/ipt/utils/DataCiteMetadataBuilder.java
  9. This mapping is described at https://code.google.com/p/gbif-providertoolkit/wiki/IPT2DataCiteMappings

Best Distribution dcat:mediaType?

If I understand mediaType, we could use the following for Darwin Core Archives:

  • zip: understandable, but very general, less informative of what to expect
  • dwc-a: less widely known, but gives good indication of what to expect

What do you suggest?

Test IPT to CKAN mapping

Here's my experience mapping dataset metadata from IPT to CKAN:

EML

IPT CKAN
Shortname URL slug name
Title Title
Description Description
Publishing Organisation Create dataset under a specific organization
Update frequency
Type
Subtype
Metadata Language
Data Language
Data License License
Resource Contacts
Resource Creators One Author and Author email
Metadata Providers One Maintainer and Maintainer email
Coordinates
Geographic coverage description
Taxonomic coverage description
Taxa
Temporal coverage
n/a keywords
GBIF keywords
CKAN keywords (to add to all datasets) Tags
Associated parties
Project title
Project identifier
Project description
Project funding
Study area description
Design description
Project personnel
Study extent
Sampling description
Quality control
Step descriptions
Resource citation
Resource citation identifier
Bibliography
Collections
Specimen preservation methods
Curational units
Resource homepage
Other data formats
Date created
Data last published
Resource logo URL
Purpose
Maintenance description
Additional information
Alternative identifiers

Other than EML

IPT CKAN
Visibility Visibility
URL of resource Source
Version Version
URL of DwC-A Resource URL
"Darwin Core Archive" Resource name
Some text Resource description
"DwC-A" Format

Extra

Usage norms?

How are dataset versions handled in DCAT

The IPT supports versions, for both datasets and distributions (both have the same version number and increase it at the same time). Example:

A harvester, like GBIF or the Flemish Open Data platform, generally wants to update its entry of the dataset (e.g. updated title, description and version number) and distribution (replacing old one with new one and increasing version number).

What is the best way to express versions in DCAT?

  • By listing only the latest version of the dataset, with the latest distribution
  • By listing only the latest version of the dataset, with all distribution versions
  • By listing all versions of the dataset, each one with its own distribution

@pietercolpaert, you mentioned:

Separate versions can be different resources which point to a generic dataset

What is then the title and description of the generic dataset? That of the latest version?

Theoretical mapping of EML to DCAT

As we plan to use this mapping in the context of the IPT, I would not map the main EML standard, but the EML GBIF profile specifically. The mapping can be (first) described as a document (cf. this document)

Question: what version of DCAT should we map to (Belgium, European, etc.)? It should work for the Flemish Open Data Platform and ideally all CKAN instances worldwide.

Timestamp in DCAT feed is off by an hour

Setup: a test IPT registered with a test organization. No published datasets.

If I generate the DCAT feed on 11:01 GMT+2, the timestamp in the feed is 10:01 GMT+2. Is this an error in the code or is something wrong on the server I'm using?

dct:issued "2015-07-29T10:01+02:00" ;
dct:modified "2015-07-29T10:01+02:00" ;

How to define the spatial data?

There has to be a resource for the spatial data in a dataset. The IPT defines two points with a longitude and latitude.

My thoughts:

dct:spatial [ geo:point [ geo:lat "65" ; geo: long "36" ] ; geo:point [ geo:lat "34" ; geo: long "15" ] ]

Dataset mandatory according to validator

Setup: a test IPT registered with a test organization. No published datasets.

If I paste the DCAT feed into the validator, I get one error:

The property: dataset is mandatory

So, if there are no published datasets, should we generate a feed at all?

Can dataset dcat:keywords be grouped in sets?

IPT keywords are grouped in thesauri. It would be useful to retain that information in DCAT, so that a harvester like the Flemish Open Data platform only imports keywords from a certain thesaurus (e.g. the one we specify with only Dutch keywords). Is there a way to express this in DCAT?

Error messages on every screen

I'm testing the IPT-DCAT on our development server and I get green error/warning messages on every screen (e.g. while editing metadata):

screen shot 2015-07-28 at 14 25 51

Is there as certain setting causing this or are those errors valid?

Dataset modified

DCAT Dataset title, description, keywords, contacts, etc. will be those as currently populated in the IPT, not those of the last published version of a resource (I need to actually verify this). For Dataset modified date however, we'll use the last publication date, as it is a more important date, which increases with the version number. That means however that some elements (title, description, keywords, etc.) might change without the modified date to change.

@timrobertson100, knowing this, would you suggest to map to last modified date instead of last published date?

vcard:Kind instead of vcard:Individual

In adms:contactPoint, the vcard is of type Kind instead of Individual (Individual is a good assumption for 99% of the contact points).

adms:contactPoint [ a vcard:Kind ; vcard:fn "Peter Desmet"  ; vcard:hasEmail <mailto:[email protected]> ] ;

What organisation name should be used for the datacatalog?

The DCAT catalog needs to have a publisher.
Now both the IPT manager and the Organisation who publishes datasets have a name and can be different. (You can have multiple organisations)
Which name should we use for the publisher of the catalog?

Write code to generate a DCAT file from EML

Once we have the mapping described (see #1), we can implement it so that when a dataset is (re)published on the IPT, a DCAT file of the metadata is created in addition to the EML metadata. There are two different approaches:

As part of the IPT code base

Advantages:

  • We can make use of the EML Java classes exposed by the IPT (https://github.com/gbif/gbif-metadata-profile)
  • We can potentially make use of the same plugin as gbif-doi to express DCAT as GBIF classes (see resources)
  • We can tap in the publication code to create a DCAT metadata file each time a dataset is (re)published
  • The functionality becomes part of the IPT (open source) code base and could in collaboration with GBIF be rolled out for all IPTs

Disadvantages:

  • We're restricted to Java to write the functionality

As a separate script

Advantages:

  • More choice of software language
  • No constraints imposed by IPT code

Disadvantages

  • None of the advantages of making it part of the IPT code
  • The script has to rely on the published EML file instead of the EML standard expressed as classes
  • The script has to be triggered in some way separately from the IPT (e.g. a cron job)

Odd whitespace and punctuation in dcat feed

Setup: a test IPT registered with a test organization. No published datasets.

The dcat feed has some . or ; at the end of each lines. Is this valid? And if required, wouldn't it be better to remove the whitespace before? E.g. @prefix schema: <http://schema.org/> . -> @prefix schema: <http://schema.org/>.?

@prefix adms: <http://www.w3.org/ns/adms#> .
@prefix schema: <http://schema.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix locn: <http://www.w3.org/ns/locn#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .

<http://.../ipt-dcat/#catalog>
 a dcat:Catalog ;
dct:title "IPT DCAT" ;
dct:description "IPT DCAT" ;
dct:publisher <http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> ;
dct:issued "2015-07-29T09:19+02:00" ;
dct:modified "2015-07-29T09:19+02:00" ;
dcat:themeTaxonomy <http://eurovoc.europa.eu/218403> ;
dct:rights <https://creativecommons.org/publicdomain/zero/1.0/> .

<http://eurovoc.europa.eu/218403> a skos:ConceptScheme ; dct:title "biodiversity"@en .

<http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> a foaf:Agent ; foaf:name "IPT DCAT" .

Caching of the information

A way to cache the information of the DCAT feed

The output of the DCAT feed is simply stored as a String
When the DCAT feed is asked the GenerateDCAT class will look at the time the String was created. If it the current time is larger than the creating time plus the caching time, the DCAT will be regenerated.

What is catalog dct:rights

Is this the license of the catalog list itself or the license of all datasets (which might be different for each dataset)?

DCAT feed generated multiple times

Setup: a test IPT registered with a test organization. No published datasets.

The DCAT feed I get is this (note: I have hidden one URL):

@prefix adms: <http://www.w3.org/ns/adms#> .
@prefix schema: <http://schema.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix locn: <http://www.w3.org/ns/locn#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .

<http://.../ipt-dcat/#catalog>
 a dcat:Catalog ;
dct:title "IPT DCAT" ;
dct:description "IPT DCAT" ;
dct:publisher <http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> ;
dct:issued "2015-07-29T09:19+02:00" ;
dct:modified "2015-07-29T09:19+02:00" ;
dcat:themeTaxonomy <http://eurovoc.europa.eu/218403> ;
dct:rights <https://creativecommons.org/publicdomain/zero/1.0/> .

<http://eurovoc.europa.eu/218403> a skos:ConceptScheme ; dct:title "biodiversity"@en .

<http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> a foaf:Agent ; foaf:name "IPT DCAT" .

If I wait a couple of minutes and reload, I get this:

@prefix adms: <http://www.w3.org/ns/adms#> .
@prefix schema: <http://schema.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix locn: <http://www.w3.org/ns/locn#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .

<http://.../ipt-dcat/#catalog>
 a dcat:Catalog ;
dct:title "IPT DCAT" ;
dct:description "IPT DCAT" ;
dct:publisher <http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> ;
dct:issued "2015-07-29T09:19+02:00" ;
dct:modified "2015-07-29T09:19+02:00" ;
dcat:themeTaxonomy <http://eurovoc.europa.eu/218403> ;
dct:rights <https://creativecommons.org/publicdomain/zero/1.0/> .

<http://eurovoc.europa.eu/218403> a skos:ConceptScheme ; dct:title "biodiversity"@en .

<http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> a foaf:Agent ; foaf:name "IPT DCAT" .

@prefix adms: <http://www.w3.org/ns/adms#> .
@prefix schema: <http://schema.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix locn: <http://www.w3.org/ns/locn#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .

<http://.../ipt-dcat/#catalog>
 a dcat:Catalog ;
dct:title "IPT DCAT" ;
dct:description "IPT DCAT" ;
dct:publisher <http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> ;
dct:issued "2015-07-29T09:49+02:00" ;
dct:modified "2015-07-29T09:49+02:00" ;
dcat:themeTaxonomy <http://eurovoc.europa.eu/218403> ;
dct:rights <https://creativecommons.org/publicdomain/zero/1.0/> .

<http://eurovoc.europa.eu/218403> a skos:ConceptScheme ; dct:title "biodiversity"@en .

<http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> a foaf:Agent ; foaf:name "IPT DCAT" .

The information is repeated, only the issued/modified timestamps are different. If I wait and reload after that, the content gets repeated again. I don't think this is intentional.

Theme of a dataset?

The theme of a dataset and the themeTaxonomy of the catalog refer to the same URI: http://eurovoc.europa.eu/5463. But the themeTaxonomy needs a skos:ConceptScheme, while the theme needs a skos:Concept

Is the given URI for the global catalog, or for each dataset?

Mapping Dataset dct:identifier

We'll add dct:identifier to the dataset, which ideally is populated with the DOI of the dataset and if not available the GBIF registry key.

2 questions for @timrobertson100

  1. Is the IPT aware of the DOI assigned by GBIF or only of DOIs assigned via the IPT?

  2. What format do we choose for the identifier: URL or none URL?

    http://doi.org/10.15468/02omly
    doi:10.15468/02omly
    http://www.gbif.org/dataset/83e20573-f7dd-4852-9159-21566e1e691e
    83e20573-f7dd-4852-9159-21566e1e691e
    

Incorrect publisher for Catalog

GBIF registers organizations (e.g. INBO) and IPT installations (e.g. the INBO IPT).

In the DCAT feed, the catalog has a publisher. This is currently mapped to the IPT installation (IPT DCAT in my test), not the organization (or organizations!) using that installation as a publisher (INBO). I think it should be the latter.

<http://.../ipt-dcat/#catalog>
 a dcat:Catalog ;
dct:title "IPT DCAT" ;
dct:description "IPT DCAT" ;
dct:publisher <http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> ;
dct:issued "2015-07-29T12:26+02:00" ;
dct:modified "2015-07-29T12:26+02:00" ;
dcat:themeTaxonomy <http://eurovoc.europa.eu/218403> ;
dct:rights <https://creativecommons.org/publicdomain/zero/1.0/> ;
dct:spatial [ a dct:Location ; locn:geometry "{ \"type\": \"Point\", \"coordinates\": [ 3.49,51.2 ] }" ] .

<http://eurovoc.europa.eu/218403> a skos:ConceptScheme ; dct:title "biodiversity"@en .

<http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5/#Organization> a foaf:Agent ; foaf:name "IPT DCAT" .


Note 1: The URL http://www.gbif.org/publisher/b2069a2e-0fb3-4193-9fee-1910694cfca5 won't return anything, because those are test environment UUIDs.

Note 2: the URL of the registered IPT installation (i.e. the catalog) might be useful information. @pietercolpaert, is there another term we can use for this?

Time format in DCAT?

The Date class used in the IPT to define a date is deprecated. A DCAT date needs to be formatted in the ISO8601 standard. This is only supported since Java 8.
Should I parse the deprecated class Date or can I convert to the Java 8 standard?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.