idn-au / catalogue-data Goto Github PK

0.0 2.0 1.0 821 KB

License: Creative Commons Attribution 4.0 International

Python 100.00%

catalogue-data's Introduction

IDN Catalogue Data

This repository contains part of the data of the Indigenous Data Network (IDN)'s Knowledge Graph which is delivered online via the Prez system as a series of catalogues and reference datasets, such as spatial data collections and vocabularies.

The IDN Prez system is online at:

https://data.idnau.org

IDN Catalogues and Datasets

The IDN is producing multiple systems and datasets:

Demonstration Catalogue of Australian datasets
- with varying levels of indigenous relevance to demonstrate several aspects of indigenous data governance, sovereignty and how to even rate the "indigenous-ness" of data in the first place.
Agents Database
- containing information about Agents - People and Organisations - that have some relation to indigenous data
University of Melbourne’s Indigenous Data Catalogue
- this is currently (May, 2023) empty but will fill shortly
Register of vocabularies
- multiple vocabularies, all assembled and some created, by the IDN that support modelling indigenous data
Indigenous spatial reference data
- indigenous language, land use, treaty and other areas
- all from other sources, attributed in the data

Additionally, the IDN will support a catalogue of ANU's indigenous data underdevelopment by ANU’s First National Portfolio that’s not online yet.

This repository contains only some of those system’s data, see next.

This repository’s content

This repository contains:

Demonstration Catalogue items' metadata
- metadata entries for the catalogued resources in data/democat/
the vocabularies within the IDN’s Register of vocabularies
- within data/vocabularies/
background ontologies used to provide labelling for Prez' data
- within data/_background/
IDN Prez system metadata
- within data/system/
- defines things like the multiple IDN catalogues, system labels etc.

Also:

data/unpublished/ contains data either previously published and removed but not deleted as it may be used again

Stored elsewhere are:

Agents Database content
- some test data is stored here in but the Agents DB is building/storing its own data within it
- see the AgentsDB data repository
Indigenous spatial reference data
- some of these datasets are large so their raw content isn’t directly available
- see the repo https://github.com/idn-au/spatial-data for a listing of the datasets and instructions on how they are produced

(Meta)Data Models

The metadata of items in the Demonstration Catalogue and all other catalogues based on IDN work - the UoM IDCat and the ANU’s FNP’s future catalogue - use the IDN Catalogue Profile which is a data cataloguing standard based on DCAT.

Agents data in the Agents Database are formulated according to the Agents Governance Profile.

License & Rights

The contents of this repository is licensed under Creative Commons 4.0 International. See the LICENSE file in the repository for details.

Contact

For technical enquiries:

Jamie Feiss
Data Infrastructure Developer
Indigenous Data Network
University of Melbourne
[email protected]

For policy:

Levi Murray
Strategic Data Manager
Indigenous Data Network
University of Melbourne
[email protected]

Owner Organisation
Indigenous Data Network
https://idnau.org

catalogue-data's People

Contributors

Watchers

Forkers

recalcitrantsupplant

catalogue-data's Issues

Create Dummy data for RIMPA presentation

To demonstrate the use case:

Cassey is a Wurundjeri woman tasked by her community with searching the collections at the Australian National University
for data about them and ensuring that the data is held appropriately.
As a representative of a community that is a stakeholder in some of the data in the system, Cassey wants to be able to
discover data about/from her community, regardless of its particular home location and then to be able to inspect the access
policies applied to it. She then wants to be able to verify that the policies, as stated, are implemented.

Create the data
Record a demonstration of the use case scenario with the dummy data.

convert aries data < 2015

Data received from Adegboyega 26/04/2024.

To be cleaned and converted to RDF in line with the previously converted aries dump of data from 2015 onwards

Update IDN AGENTS DB registration procedures

Update IDN Agents DB registration workflow in line with https://agentsdb.idnau.org/: https://docs.google.com/drawings/d/1SZNf2S6lI2QmLif5U84TFUw498VawgVFXIpN7D5-oL0/edit?usp=sharing
Write up governance process with Roles & Responsibilities

Can't access RDF metadata

No access is given to the RDF forms of a resource's metadata within the IDN Demo Cat.

For example, for the Distribution of the Aboriginal Tribes of Australia (1940) resource, I can't add conventional query strings to the PID IRI to get RDF:

https://data.idnau.org/pid/DATA40?_mediatype=text/turtle -- just gives HTML

and there is no access to the Alternate Profiles anywhere linked to on the page or by QSA convention:

https://data.idnau.org/pid/DATA40?_profile=alt - 404

Validation in the IDN vocabs is out of step with vocExcel 6.*

I just put the licence vocab in and it failed validation. I changed the skos:historyNote with dcterms:provenance and it validated so I guess the validator in the IDN vocabs is the older version and will need to be updated at some point.

Harvest data from ANU OAI-PMH

https://openresearch-repository.anu.edu.au/oai/request?verb=ListSets

Start with the ANU Thesis sets and then anything else with ANU in the title.

It is a Dspace server.

push updated prez image

release v3.8.11 contains fixes from recent PRs.

Test locally with PrezUI v3.8.2
Push to Dev AWS / Terraform
Push to Prod MRC

Add slides for Rimpa presentation

https://docs.google.com/presentation/d/1COHPQRlVBX4OoLL-3e0z9QYvAbIxW4QfAZVjuISrm6I/edit?pli=1#slide=id.g2dc0d7322cf_2_6

Aim for ~ 15 mins of content.

Talk over ETL and then searching using sparql / the portal

expecting about 6 diagrams / screenshots and speaker notes.

Add Metadata for ARIES/Thesis/OAIPMH data.

need to define datasets and vocabularies for all the data that i have extracted from ANU.

clean it up and add all neccessary metadata for proper rendering in PrezUI as we will soon need to upload it all to PrezUI hosted at NCI.

ensure publications are spatially linked where possible

Just as a start use the datasets we have that are being matched under the mentionsIndigenousLanguage flag.

and make sure that you can use a map to search spatially for publications.

update methodologies.md for the aries parser repository

include details about
what data prep has been done on each source dataset.
characters stripped out etc.
columns concatenated, or split...

Updating the indigenous-persons-organisation definition

Hi Nick - I went to create a pull-request but then realised I didn't know how to in this repo so I will submit the extended definition as an inssue instead. I have updated the definition to include the Office of the Registrar of Indigenous Corporations (ORIC) indigeneity requirement. I have also added a historyNote to show where the definition comes from and the hyperlink to the policy document. I am not sure if the code is correct but I thought I would have a go. - Margie

:indigenous-persons-organisation
a skos:Concept ;
dcterms:provenance "Created for the IDN project, 2022"@en ;
rdfs:isDefinedBy cs: ;
skos:definition "The organisation comprises indigenous persons that meet the Office of the Registrar of Indigenous Corporations (ORIC) indigeneity requirement." ;
skos:historyNote "Office of the Registrar of Indigenous Coporations, Policy Statement 11." ;
rdfs:seeAlso https://www.oric.gov.au/sites/default/files/documents/01_2022/PS-11_Indigeneity-requirement_v7-0.pdf ;
skos:inScheme cs: ;
skos:prefLabel "Indigenous Persons Organisation"@en ;

Migration of scripts from K-AI to ANU

30 Apr - Lawson to ask Jamie if the migration of the ANU prez dev to NCI is something he needs to get info about

26-Mar Boyega to bring Jamie up to speed of any work required

migrate from kurrawong graphdb to anu nci fuseki

rerun the scripts with new data from boyega. (pure extract)
ensure metadata is presentable in prez (appropriately catalogued)
get access to fuseki.
migrate to nci.

Briscoe-Smith Metadata

Communicate with Len and Sandra to arrive at proper metadata for the briscoe-smith archive resource.

i.e. the resource that represents the archive itself.

Use the metadata entry tool to create the RDF and gather the required details.

There may be some conflicts with the information presented in the METADATA table in the HDMS database, to be clarified with Len and Sandra.

Start ANU catalogue documentation

add management notes for FUSEKI to K-AI documentation
add relevant FUSEKI management info to ANU docs

Pointing to data and/or to landing/provenance description pages -- AUSLANG example -- for consideration

Thinking out loud.... purpose -- to be very clear and distinct about what we are showing in the catalogue.

As at 2022-08-30 1.26pm: exploring http://idn.kurrawong.net/catalog/idndc/AUSLANG

Followed the "Access address" link which resolves to a download of the actual dataset.

Checking under the hood (curl -H -v) the target url is https://collection.aiatsis.gov.au/datasets/austlang/001.csv

The "ex:home" page is https://collection.aiatsis.gov.au/datasets/austlang/001 which is the provider's landing page, providing a data dictionary, a link to a live online search service and a link to the download csv format above.

There is also this: https://collection.aiatsis.gov.au/austlang/about which the data.gov.au LANDING page falsely describes as the "complete Austlang resource" -- it is not, but it is AIATSIS's complete DESCRIPTION of the context, meaning and provenance of the dataset! In fact. one could make a strong argument that unless this is read, one really doesn't understand at all what you are looking at.

There is also data.gov.au's activity list, showing recent changes: https://data.gov.au/data/dataset/activity/austlang-dataset-001

Starting to think we might need a small vocab to encode a range of relationships between the thing our catalog entry is describing and an associated resolvable uri!

e.g.

DESCRIPTIVE relationship: Contextual information (descriptive metadata, provenance) ABOUT the cataloged "dataset":

data.gov.au landing page
data.gov.au activity page (where relevant - calculable if you use this form of the ID: https://data.gov.au/data/dataset/austlang-dataset-001)
data providers' own general landing page
data providers detailed provenance "about" page:

DERIVATIVE or INSTANCE relationship: An accessible distribution of the catalogued "dataset":

a computer file with a particular format (in this case csv)
a service end point

I can even imagine a "USAGE" or "APPLICATION" type of relationship -- could point at resources in which dataset featured, or educational/capability resources in how to use it.

Open to discussion but in this PARTICULAR case (a really core "reference" dataset), I think the catalogue presentation would benefit by having multiple "Access address" entries with a clear typing of the nature of the association. Or perhaps even a way for the user to choose their "focus" ("I know what I want... just point me at the data/service" versus "what the hell is this all about, how can I use it?").

Sorry to ramble but devil in the detail here!

Plus issues with scalability... we perhaps need to be thinking about patterns in what "portals" like Trove or data.gov.au or RDA are doing? But that's another story.

Briscoe-Smith DataModel

Create a first draft data model (lucid chart diagram)

Use the RiC-O Ontology to model the data and ensure encapsulation of all primary data fields from.

INVENTORY
SERIES
ACCESSION
PROVENANCE

VocPub profile formats listing missing text/anot+turtle

I believe this is just because it is the current profile view.

RDF for VocPub Profile for list of vocabs does not give labels

Potentially an issue with the query generated by the generate_listing_construct() function in object_listings.py

convert languages to objects

currently they are just comma delimited string literals. Terhi would like to see them as objects so that analysis is easier.

Digitize maps from Len

16-Apr Have talked with Liam and he is half way through
09-Apr Need to follow up with Liam to see how he is going with the work

05-Mar Check in with Liam to see if this has started

20-Feb Liam is currently handling this work

Work with Liam to digitize the maps from Len. Once we have them as shape files, we can convert to RDF

Review IDN Prez config

System:

Vocabs:

Deduplicate 2 x License vocabs - PR 12
#24
- fix labels in https://github.com/idn-au/vocab-data/blob/main/vocabs/licenses.ttl by putting the acronyms in skos:notation rather than skos:prefLabel and using the long name in the skos:definition for a new skos:prefLabel value
#35

Spatial:

WKT data for NNTT dataset
Feature labels for NSW data
Search returns no results

Label fix steps:

add descriptions to label versions of background ontologies (https://github.com/Kurrawong/semantic-background/blob/main/scripts/extractor.py)
reproduce *-labels.ttl for all ref onts (in https://github.com/Kurrawong/semantic-background/tree/main/labels)
replace full ontology n-quads with just labelled versions from above in https://github.com/RDFLib/prez/tree/main/prez/reference_data/context_ontologies

Create a governance testing KG

Create a small RDF KG that lists dummy datasets, people, organisations and policies and allows demo querying of it to discover good and otherwise governance arrangements.

RDF text/annot+turtle for Members does not apply .ttl extension to file returned

Issue Identified.

when requesting an annotated mediatype the response from Prez (API) included the 'anot+' prefix in the content-type header.

see: http://localhost:8000/v/vocab?_profile=prfl:mem&_mediatype=text/anot+turtle

Review IDN Demonstration Catalogue content

Reframing of the IDN Demo Catalogue to only include up to 20 entries so that different record types are "demonstrated".

server error

http://idn.kurrawong.net/dataset/NDT/collections/determinations/items/DCD2014-003

This should return the "Brunette Downs" Feature from the NNTT Determinations dataset. Instead:

Internal Server Error

Sorry, something went wrong in server land.

ss 2022-09-30

Capitalise Indigenous throughout

Ensure that all uses of the word "Indigenous" are capitalised

Flag ANU data as indigenous using a variety of techniques.

02-Apr Work is ongoing from this point

26-Mar There is now a formal process in place to make it easier to flag things against. Some techniques (English words) have been surprisingly useful:) - Flagging matches where words are not english (reverse dictionary). Outcome - a list of Indigenous Publications.

Will next search for landmarks/features/place names - particulary old names. e.g., 'Uluru' was 'Ayers Rock'

Need to expand the reference set of data used to flag works as indigenous, using the data sources provided by Adegboyega.

Where formal data dictionaries are publically available, point at them directly from the catalog?

AGIL is a great dataset to start with. I note it has a formalised (but not machine readable) data dictionary here: https://data.gov.au/data/storage/f/2013-12-02T03:02:16.736Z/agildataset-management-summary-2013-11.pdf which is dated 2013! Those contents really help a potential data user understand in depth what one might be able to do with that dataset.

Eventually I would like to see the catalog capable of pointing directly at such a DD, if one exists, as a kind of "in depth" descriptive resource, and to show the date of that resource (could that be added to our profile as an optional item?).

Risks with broken links using URLs of course (but we can detect that), so perhaps consider mining a copy of those sorts of really core files (there won't be many)?

Contribute ANU researchers to Agents DB

12-Dec-23 Nick will attempt to include this in the workshop tomorrow

Nick will run through the process this afternoon (5/12/23)

Second pass underway

Filter the AERIES list of ANU people for indigenous researchers

Briscoe Smith Archive POC

28 May - work has started based on the RICO ontology [this is also what the UniMelb are going to be using]

Begin work for ANU on conversion of the Briscoe-Smith archive.

As discussed with Lenoard Smith and Sandra Silcot from ANU.
There is a piece of work to move the Briscoe-Smith archive from an Access Database to RDF and store it with the rest of the ANU Catalogue using Prez on the NCI infrastructure.

This piece of work can be broken down into three parts.

Metadata modelling
Create a suitable ontological model to support the archive and its needs. Collaboration between KurrawongAI and ANU will be needed to arrive at the destination here.
Conversion
Once an ontological model has been established, convert the data from RDB to RDF in line with the model. Some parsing of unstructured fields may be required to achieve alignment with the desired model.
Metadata Entry form tooling
The Archive will require continued additions from not-yet catalogued items. There are a number of possible solutions available to support this.
1. An adapted version of the Metadata entry tool from the IDN project.
2. VocExcel templates
3. custom built data entry/management portal.

Initially, I (LL) will try to convert a sample of the archive to RDF, a rough metamodel draft can be agreed upon with refinement to happen later. The idea for this first pass will be to just get some data converted and visible in the new system (prez) so that Len and Sandra can get a feel for the process and how it might play out.

Begin ISU data enquiry in preparation for a catalogue

Review Data Maturity Model and then:
Request five datasets from ISU to categorise in spreadsheet to prototype cataloguing
Work with ISU on spreadsheet if required
Determine what ISU information is required to align with UoM catalogue refresh

Improving ANU data discovery

How best to include DOIs/PIDs to publications derived from a dataset

These publications were derived from research using the KHRD database:

DOI (just a short note; includes a statement it's under permanent embargo: McCalman, J., Smith, L., Silcot, S., & Kippen, R. (2018). Koori health research database. figshare, https://doi.org/10.4225/03/5a9779c80f529
DOI substantive research work arising from and good description of the provenance of the database: http://doi.org/10.1007/s12546-020-09253-x as listed in "Minerva" (Unimelb's research publications repo, here: https://minerva-elements.unimelb.edu.au/viewobject.html?cid=1&id=1517000)

SS to confirm it is appropriate to add this to the catalog entry as per DCAT2 spec (https://www.w3.org/TR/vocab-dcat-2/#examples-dataset-publication):

dct:isReferencedBy <https://doi.org/10.4225/03/5a9779c80f529>;
dct:isReferencedBy <http://doi.org/10.1007/s12546-020-09253-x>;

Check that CatPrez will appropriately render this relationship (does not need to resolve/fetch anything).

Switch over to Pure extract

can remove old aries extracts.

Document current ANU methodology for catalogue

26-Mar Being added to as required

27-Feb Extract a generalisation of the methodology to apply to research organisations using ANU as the examplar

Replace License prefLabel acronyms with words

submitted fix in PR # 19 under vocab-data repo

RDF for VocPub profile of a vocab results in a file `all.ttl` but should have a better name

system fix required in prez. lawson to investigate

Improve the next AIATSIS dataset

30 Apr - Need to follow up again as we are running out of FY

16-Apr Follow up with Anthony again.

Following on from the meeting will be suggestions to AIATSIS re AusLang (they need to adopt PIDS) - Thesaurus

Metadata schema for Yumi Sabe could be helpful to begin an IDN - AIATSIS task for discussion.
Could also consider having AIATSIS being the sponsor for the IDC metadata profile.

20-Feb Waiting on feedback from Anthony

Feedback on the PlaceNames gazzetteer still pending - potentially ask them what they would like to do next?

Other databases:
https://aiatsis.gov.au/research/guides-and-resources/native-title-resources/native-title-law-database

https://aiatsis.gov.au/third-national-indigenous-languages-survey-online/language-status-map-and-graph-data

https://collection.aiatsis.gov.au/austlang/about and then on data.gov.au https://data.gov.au/data/dataset/austlang-dataset-001

LL 08/01/24
Heard Back from Anthony just before Christmas but no meeting scheduled yet. Hopefully hear from him again this week.

Follow up with NCI regarding the catalogue space

4-Jun Information received and will now seek advice on the next steps. The records in the new PURE extract are around 135 000 vs 450 000 from the first extract out of AERIES.

28-May Gboyega to really follow up today

21-May Gboyega will reach out today (post Lawson's engagement with NCI) to begin planning next steps.

14-May Robert having issues with CORS requests. Lawson and Jamie to debug today. will check in with Rob to see if anything else is needed after this issue is resolved.

07-May Robert and Lawson are communicating across this. The assumption is that the PoC is up and running. Next step is to migrate from the Kurrawong infrasstructure to ANU
Lawson will chase up with an email to confirm

30 Apr - Need to follow up with Rob again. Lawson has been keeping them up to speed on the technical questions. Lawson thinks Rob has it up and running.

mint PID's for ANU Departments.

Pull departments from anuthesis.csv in the Author Affiliation column.

harvest new collections from oaipmh

As per email from Boyega.

All Metadata in ANU Open Research is open access and harvestable. This is the OAI-PMH details https://openresearch-repository.anu.edu.au/oai/request?verb=Identify that allow you to harvest the metadata.

The Full ANU Research collection contains the following, plus many more collections that may contain some relevant materials:

[NARU](https://openresearch-repository.anu.edu.au/handle/1885/9187) – This is still in progress as we digitise more ANU and NARU publications over the coming year or two
[CAEPR](https://openresearch-repository.anu.edu.au/handle/1885/114085) – This is deposited into directly from CAEPR, so should be fairly up to date with their working papers/reports etc
[NCIS](https://openresearch-repository.anu.edu.au/handle/1885/9491)
[ANU First Nations Portfolio](https://openresearch-repository.anu.edu.au/handle/1885/272442)
[ANU Research publications](https://openresearch-repository.anu.edu.au/handle/1885/26) - will have the most overlap with ARIES as we have a feed from ARIES to this collection.

The Archive and Library Collections section will also have indigenous materials that may be of interest, including photographs from researchers and the ANU Photography department, ANU Annual Reports, AV material,

[ANU Publications](https://openresearch-repository.anu.edu.au/handle/1885/238435) – digitised ANU publications from the library collection, may also contain NARU material and other relevant historical publications
[ANU Publications: Flood Replacements](https://openresearch-repository.anu.edu.au/handle/1885/207875) - digitised ANU publications from the library collection that were lost during the Chifley floods.