The movepub's discuss from inbo

Provide copy of `read_package()` and `write_package()` from frictionless

Currently, users of movepub have to load the frictionless package to work with datasets (see https://inbo.github.io/movepub/articles/movepub.html).

It might be useful if the functions read_package() and write_package() are copied from frictionless and exposed through movepub, cf. how readr::problems() is exposed in camtraptor. That also avoids the name clash warning between frictionless::add_resource() and movepub::add_resource().

Add intro paragraph to dataset

Add an intro paragraph to the dataset:

This animal tracking dataset is derived from Spanoghe et al. (2022, https://doi.org/10.5281/zenodo.5879096), a deposit of Movebank study ID 1099562810. Data have been standardized to Darwin Core using the movepub R package and are downsampled to the first GPS position per hour. The original dataset description follows.

Note: The write_movebank_dwca() function is to perfect place to edit the eml and write it to the directory too.

Change altLabel for `barometric pressure`

http://vocab.nerc.ac.uk/collection/MVB/current/MVB000055/ currently has the altLabel bar:barometric pressure. That makes it harder to parse with get_movebank_term(). @sarahcd will update this in a next release to:

bar barometric pressure

Use dplyr for write_dwc()

write_dwc() currently uses sqlite for the transformation. This was chosen so that the transformation itself could be written in SQL, which is more universally understood. Those SQL files are referenced in the function documentation: https://inbo.github.io/movepub/reference/write_dwc.html#data

Annoyingly, this adds two dependencies (RSQLite, DBI) and it is currently affected by a change in glue: #59. Just like in camtraptor/camtrapdp (see inbo/camtraptor#207), I suggest to move the transformation from SQL to dplyr::mutate(). dplyr is already a dependency.

Todo:

Update transformation to dplyr::mutate()
Update function documentation
Add @sangovae as contributor
Run tests: the snapshot should remain the same
Update https://github.com/tdwg/dwc-for-biologging/wiki/Movebank-GPS-data#implementation
Later: split function into write_dwc() and write_eml(), see #57

Add alternative labels for `magnetic field raw x` and alike

Provide alternative labels for (official first, alternative second):

magnetic field raw x:mag magnetic field raw x
magnetic field raw y: mag magnetic field raw y
magnetic field raw z: mag magnetic field raw z
light level: gls light level
Ornitela transmission protocol: orn transmission protocol
~~study time zone~~: study timezone is now the official term http://vocab.nerc.ac.uk/collection/MVB/current/MVB000177/
~~local timestamp~~: study local timestamp is not the official term http://vocab.nerc.ac.uk/collection/MVB/current/MVB000140/

Original comment:

These fields are only defined without the mag part in the Movebank attribute dictionary (http://vocab.nerc.ac.uk/collection/MVB/current/MVB000151/), in contrast with e.g.

gps:hdop: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000118/
eobs:key-bin-checksum: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000095/
bar:barometric-pressure: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000055/ (alt label)

Split `write_dwc()` into `write_dwc()` and `write_eml()`

Ideally, write_dwc() can be used in an operational pipeline and doesn't generate errors. Currently, that is not the case, since it requires e.g. a rightsholder, contact person, etc. Does properties are actually required for the EML part, not the Darwin Core transformations.

Update https://github.com/tdwg/dwc-for-biologging/wiki/Movebank-GPS-data

Update message when loading

@sarahcd in #36 (comment):

... One additional request is for the messages shown as the code is executed. Where messages say 'Please make sure you have the right to access data....', please somewhere refer to 'Movebank’s terms of use at https://www.movebank.org/cms/movebank-content/general-movebank-terms-of-use'. If you point me to the file/s with the message text I can suggest it there directly.

Get data type from Movebank Attribute Dictionary

In a future update of the Movebank Attribute Dictionary, data type will be indicated in the definition. Rather than letting create_schema() guess the types, it would be better if those are retrieved from the MVB Dictionary.

Create vignette on how to create data package

Populate `minimumHeightAboveSurfaceInMeters` from different fields

Comment by @sarahcd:

gps."height-above-ellipsoid" AS minimumDistanceAboveSurfaceInMeters
Some datasets have 'height-above-ellipsoid'. The datum is different but both fit the definition of minimumDistanceAboveSurfaceInMeters.
Is it possible to map conditionally depending on which attribute is present?
Rarely, both attributes are present. And there is also 'height-raw', but this does not define units so should be ignored.

Try `write_dwc()` on non-Zenodo dataset

@sarahcd do you have a good example dataset in the Movebank repository that I can use to test if the write_dwc() function handles it elegantly?

`o_assen` does not pass the frictionless checks anymore

Has something changed in frictionless?

1: package must be a Data Package object.
✖ package is missing a "datapackage" class.
ℹ Create a valid Data Package object with read_package() or create_package().

Very imprecise coordinateUncertaintyInMeters

Reviewer of data:

65 of the entries in the O_AMELAND dataset are "999998.9", or 1000 km, and indicate that the coordinates for these
entries cannot be trusted. If that cUIM is plausible, those 65 records should be discarded.

My answer:

Before records are uploaded to Movebank (from which they are derived for GBIF) we calculate whether or not they constitute an outlier based on speed and angle (https://github.com/inbo/bird-tracking/blob/master/src/outliers.Rmd#L55), not (potentially imprecise) height and precision values. We exclude outliers when publishing to GBIF, we prefer not to exclude records based on the precision.

@sarahcd @tucotuco what would you suggest?

Write orcids as https

Update http to https in:

movepub/R/write_dwc.R

Line 146 in afffbde

list(directory = "http://orcid.org/", contact$comment[["ORCID"]])

Once gbif/ipt#1819 is included in the next version of the IPT.

Unrecognized character in definition for `GSM signal strength`

See http://vocab.nerc.ac.uk/collection/MVB/current/MVB000126/:

An integer proportional to the GSM signal strength measured by the tag. Valid values are 0�31, or 99 for unknown or undetectable. Higher values indicate better signal strength. Example: '39'; Units: arbitrary strength units (asu); Entity described: event

Grouping events

The setup suggested in #7 has a grouping event (deployment) and occurrences (1 tag attachment, 0-more gps positions). In line with the new model, @timrobertson100 suggests to have a separate event for all occurrences (basically eventID = occurrenceID), and to group all GPS positions under the deployment using parentEventID. The deployment itself also has occurrence attached: the tag attachment. See table:

occurrenceID | eventID   | parentEventID | basisOfRecord | eventRemarks
------------ | --------- | ------------- | ------------- | ------------------
ani1_tag1    | ani1_tag1 |               | HumanObs      | deployment remarks
occ1         | occ1      | ani1_tag1     | MachineObs    |
occ2         | occ2      | ani1_tag1     | MachineObs    |

@sarahcd I'm trying this approach to see how it looks at GBIF. Given that the capture and deployment event are one and the same in this model, I would name it tag deployment rather than tag attachment.

Terms like `deploy-on measurements` written with hyphen in NERC

@sarahcd I noticed that the following (new) terms have labels that are written with hyphen:

deploy-off measurements: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000354/
deploy-off sampling: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000355/
deploy-on measurements: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000356/
deploy-on sampling: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000357/
GSM MCC-MNC: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000125/
animal birth-hatch latitude: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000338/
animal birth-hatch longitude: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000339/
study-specific measurement: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000178/

This in contrast with previous terms:

deploy off timestamp, etc.: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000077/
deploy on timestamp, etc: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000081/
animal death comments, etc.: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000013/
In fact, none of the other terms are written with hyphen.

The extra hyphen makes it annoying to find the term, since we ignore all -, _, and . to look up a term. Is it possible to change the labels to make them consistent?

License not recognized by GBIF

See gbif/ipt#1778

This is recognized by the IPT editor, but not by other systems:

<intellectualRights>
      <para>Public Domain (CC0 1.0)</para>
      <rights>Creative Commons Zero v1.0 Universal</rights>
      <rightsIdentifier>cc0-1.0</rightsIdentifier>
      <rightsIdentifierScheme>SPDX</rightsIdentifierScheme>
      <rightsUri>https://creativecommons.org/publicdomain/zero/1.0/legalcode</rightsUri>
      <schemeUri>https://spdx.org/licenses/</schemeUri>
</intellectualRights>

Source DOI as alternative identifier in EML?

Me:

Should the source DOI be an alternative identifier for the dataset in EML?

@timrobertson100:

I think so. I’m not sure if GBIF will then use it though… can you try perhaps?

Create vignette on how to transform to Darwin Core

Handle missing fields

E.g. animal-reproductive-condition can be mapped, but is not in the source data we have. The function should be to handle that.

Potential fields:

ref.animal-reproductive-condition AS reproductiveCondition
gps.comments AS eventRemarks
ref.tag-model in eventRemarks

Exclude empty coordinates

@sarahcd is it possible for the coordinates of the gps records to be NULL or will those be marked as visible=false?

Important point, thanks for catching this! It is possible to have null (empty) coordinate fields that are not marked as outliers. (Sometimes there are other sensor measurements in the record, e.g., battery charge, that the user might want to retain in some contexts, so we don't automatically flag these.) Ideally this would be a preprocessing step, removing these before subsampling records.

Originally posted by @sarahcd in #7 (comment)

Sampling protocol

@sarahcd what sampling protocol to use:

Deployment start: tag deployment start?
GPS positions: gps or gps sensor?
Deployment end: tag deployment end?

Records outside of deployment

@sarahcd, in 2017 you remarked:

Published datasets in Movebank often include records (rows in the data file) that do not represent a location where an animal was observed:

Pre- or post-deployment records. For these 'individual-local-identifier' is blank. This will only occur rarely; I think none so far include them but we have a dataset in review with data from some undeployed test tags.

I assume all published datasets exclude points outside a deployment? Or should we filter on those in the SQL?

Paragraph needs to start with HTML to be recognized as `CDATA`

I currently used a hack to circumvent ropensci/EML#342 Would be nice if I didn't have to:

movepub/R/write_dwc.R

Lines 64 to 65 in ad44aaa

    
           # Add span to circumvent https://github.com/ropensci/EML/issues/342 
        
           "<span></span>This animal tracking dataset is derived from ",

See how IPT handles `userId` in EML

Expected:

<userId directory="http://orcid.org/">0000-0002-8442-8025</userId>

Current output:

<userId>https://orcid.org/0000-0002-8442-8025</userId>

Number of decimals in coordinates

Reviewer of data:

There is also minimal correlation between coordinate precision and cUIM. Four entries with 3 decimal places in
decimalLatitude have cUIM 2.8, 3.5, 4 and 5.9, although the precision in latitude for 3 decimal places is ±55.5 m
(https://en.wikipedia.org/wiki/Decimal_degrees#Precision).
(There are 256 decimalLatitude entries with 15 decimal places, e.g. "53.165282500000004". These have an implied
uncertainty of ±0.055 nanometers, or roughly half an atomic radius, and should have been rounded.)

My answer:

The decimal places of the coordinates are as is, i.e. how these are recorded in the source database and/or recorded by Movebank. Unfortunately that does include false precision (e.g. trailing 000 being dropped or more decimals than reasonable), but we prefer to provide the raw value (as available in the source record), rather than rounding/extending zeros.

@sarahcd @tucotuco what would you suggest? Should we round/extend?

Movebank study ids are not 32 bit integers

In write_dwc there is an assertion

assertthat::assert_that(
    !is.na(as.integer(study_id)),
    msg = glue::glue("`study_id` ({study_id}) must be an integer.")
  )

This is an error because movebank study ids are not necessarily 32 bit integers, eg

https://www.movebank.org/cms/webapp?gwt_fragment=page=studies,path=study2391441038

but 2391441038 is not a 32 bit integer.

Could possibly replace with

assertthat::assert_that(
    regexpr("^\\d+$",as.character(study_id))==1,
    msg = glue::glue("`study_id` ({study_id}) must be an integer.")
  )

Add `study_id` parameter to `write_dwc()`

Indeed, the study URL is pulled from the second alternativeIdentifier in the EML:

movepub/R/write_dwc.R

Line 96 in d9f1a2d

study_url <- eml$dataset$alternateIdentifier[[2]]

To allow users to define it, we could:

Add a parameter study_id (from which the link can be build)
Have users correct add this when they edit metadata in the IPT

Given that the link is used in the description and as external link, I think a parameter would be good?

Originally posted by @peterdesmet in #25 (comment)

Make clickable aphia_id in message

write_dwc() currently returns the found aphia_ids:

library(frictionless)
library(movepub)
p <- read_package("https://zenodo.org/records/10055071/files/datapackage.json")
#> Please make sure you have the right to access data from this Data Package for your intended use.
#> Follow applicable norms or requirements to credit the dataset and its authors.
#> For more information, see https://doi.org/10.5281/zenodo.10055071
dwc <- write_dwc(p, directory = ".")
#> 
#> ── Reading data ──
#> 
#> ℹ Taxa found in reference data and their WoRMS AphiaID:
#> Circus cyaneus: 159372
#> Asio flammeus: 212685
#> Circus pygargus: 1037307
#> Circus aeruginosus: 558541
#> Buteo buteo: 558534
#> 
#> ── Transforming data to Darwin Core ──
#> 
#> ── Writing files ──
#> 
#> • 'eml_path'
#> • 'dwc_occurrence_path'

^{Created on 2023-12-22 with reprex v2.0.2}

It would be nice if those aphia_ids were clickable links, which can be done with:

library(cli)
library(movepub)
taxon <- get_aphia_id("Circus cyaneus")
cli::cli_li("{taxon[['name']]}: {.href [{taxon['aphia_id']}]({taxon['aphia_url']})}.")
#> • Circus cyaneus: 159372
#> (<https://www.marinespecies.org/aphia.php?p=taxdetails&id=159372>).

^{Created on 2023-12-22 with reprex v2.0.2}

But it takes some handling of NA and multiple values that I haven't been able to debug.

Create helper functions for `write_dwc()`

write_dwc() is currently too long and is better split in helper functions. This will also help to call certain functions for certain data (gps vs bird ring etc.)

Still in write_dwc()

Read resources as dataframes (we don't want to do this multiple times)
Get taxa from ref (we don't want to do this multiple times)
Binding the occurrence df from the helper functions
Adding the DATASET-LEVEL terms

`dwc_occurrence_ref()

File: dwc_occurrence_ref.R
Document function, but use @noRd
Parameters: ref (dataframe), taxa (dataframe)
Applies expand_col() in function, but only with the columns that are used in the mapping
Applies dplyr mapping
Returns df

`dwc_occurrence_gps()

File: dwc_occurrence_gps.R
Document function, but use @noRd
Parameters: ref (dataframe), taxa (dataframe), gps (dataframe)
Applies expand_col() in function, but only with the columns that are used in the mapping
Applies dplyr mapping
Returns df

Map `deploy-on-measurements` and other new fields to Darwin Core

In inbo/bird-tracking#197 a couple new metadata fields were added:

alt-project-id (new field): no need to map
deploy-on-measurements (new field): could be mapped to dynamicProperties
deployment-comments: now contains reduced content
study-site (release_location): could be mapped to locality for human obs
tag-firmware: no need to map

Publish first production dataset

@sarahcd @timrobertson100 I now have an operational function to create Darwin Core from a Movebank dataset: https://inbo.github.io/movepub/reference/write_dwc.html

I have used it to create:

I now want to use this to create the first production dataset, but would like some feedback on:

Is there sufficient metadata? See #9
Are we ok with the structure of the data? See also last remark at #7
What DOI to assign: #15

Include `scientificNameID` to make datasets valid for OBIS

This would require a call to WoRMS, with the worrms package. I see no function to get the LSID directly, so it would be something like:

aphia_id <- worrms::wm_name2id(name = "Larus fuscus")
scientific_name_id <- paste("urn:lsid:marinespecies.org:taxname", aphia_id, sep = ":") # but keep empty if aphia_id is empty

Define how to populate metadata

Hi @sarahcd, here's a first attempt at a Movebank dataset on GBIF: https://www.gbif-uat.org/dataset/0ef15f32-b41d-4274-ae96-eb5d0059fee6

Dataset that is included:

Deprecate `get_mvb_term()`

See if https://bartk.gitlab.io/move2/reference/movebank_get_vocabulary.html can replace this function.

Populate record level terms from DataCite

type
license

Could pull from DataCite, e.g. https://api.datacite.org/dois/application/vnd.datacite.datacite+xml/10.5441/001/1.vp4cf4qg:

<rightsList><rights rightsURI="http://creativecommons.org/publicdomain/zero/1.0/">Creative Commons Universal Public Domain Dedication (CC0 1.0)</rights>

Or Movebank REST API, e.g. https://www.movebank.org/movebank/service/direct-read?entity_type=study&study_id=2911040

license_type = CC_0

rightsHolder: INBO

Should typically be the institution that collected the data. Not currently represented in DataCite metadata from the repository. In Movebank, I plan to add a study-level attribute that combines institutionID and rightsHolder.

datasetID: https://doi.org/10.5281/zenodo.5879096

DOI as a URL, makes sense to always use a PURL and refer to the Movebank study elsewhere

institutionCode: MPIAB

institutionCode: = 'custody' of the data. Decided to use 'Max Planck Institute of Animal Behavior'.

collectionCode: Movebank

Given example 'ebird', can use 'Movebank'.

datasetName: O_WESTERSCHELDE - Eurasian oystercatchers (Haematopus ostralegus, Haematopodidae) breeding in East Flanders (Belgium) [subsampled representation]

Using a suffix to avoid systems like ORCID to consider the datasets as duplicates

informationWithheld: see https://doi.org/10.5281/zenodo.5879096 or see metadata

Add tests for `get_aphia_id()`

The helper function get_aphia_id() is currently 1) hidden and 2) has no tests.

It might be good to expose it as a public function done in 02333a8
Add tests, e.g. for:
- get_aphia_id("Mola mola") Should return 1 record
- get_aphia_id(c("Mola mola", "not_a_name")) Should return 2 records, one NA
- get_aphia_id(c("Mola mola", "?")) ? is handled differently by wm_name2id_(), but should return NA
- get_aphia_id("Pisces") Unaccepted taxa, should return as normal
- get_aphia_id(c("not_a_name")) Should return NA, but currently has an error

html code is lost during conversion to eml

In the eml file, the second paragraph of the metadata does not contain html code. So when uploading on the IPT, the links and formatting are not preserved.

The question is: is it worth it to debug, or more time efficient to do this manually?

Order of data

@sarahcd the row order is strange, e.g., below. If I download the csv from Movebank, the records are not in this order, so that shouldn't be the reason.

@sarahcd can you clarify what row order you would expect? First by animal_tag, then timestamp?

Create function to write EML

See https://docs.ropensci.org/EML/articles/creating-EML.html

Fix bug with updated glue

glue 1.7.0 includes some changes to glue::glue_sql() which has introduced a bug in write_dwc():

Error: Expected string vector of length 1

Very precise coordinateUncertaintyInMeters

Reviewer of the data:

Please consider recalculating or re-estimating the entries in the coordinateUncertaintyInMeters field. I would have
expected that a default uncertainty could be provided for a GPS tracking unit (or a handheld GPS). Instead, the entries
are hugely varied and given to 0.1 m, which (again) is unbelievable.

My answer:

Values in this field are based on what is recorded as horizontal precision by the GPS tracker. These records can show outliers. We prefer to provide the raw value (as available in the source record), rather than setting a cut-off (e.g. the default 30m as lower value for GPS).

@sarahcd @tucotuco what would you suggest?

Fix links in vignettes/movepub.Rmd

The links in vignettes/movepub.Rmd are broken. Sarah provided new links: #49 (comment)

Imprecise values in minimumDistanceAboveSurfaceInMeters

Reviewer of data:

Please consider deleting the minimumDistanceAboveSurfaceInMeters field. In the O_AMELAND dataset the entries
range from -405 (depth) to 6467 (elevation), both of which are unbelievable and make the other entries suspicious.

My answer:

Values in this field are based on what is recorded as height by the GPS tracker. These records can show outliers. We prefer to provide the raw value (as available in the source record), rather than setting a cut-off, which can vary from project to project.

@sarahcd @tucotuco what would you suggest?

DwC mapping of Movebank

Record-level

See #12

Deployment

ref."tag-manufacturer-name" + ref."tag-model" "tag attached to "
CASE
 WHEN ref."manipulation-type" = 'none' THEN 'free-ranging animal'
 WHEN ref."manipulation-type" != 'none' THEN 'manipulated animal (see original dataset)' -- Or exclude these animals from GBIF. 
END
"by" ref."attachment-type"

minimumDistanceAboveSurfaceInMeters: NULL
decimalLatitude: round to 5? keep as is
decimalLongitude: round to 5? keep as is
geodeticDatum Make conditional on decimalLatitude
coordinateUncertaintyInMeters: 30? Use 1000, make conditional on decimalLatitude
taxonID: available? Yes, in database (ITIS TSN), but would have to be retrieved through https://www.movebank.org/movebank/service/direct-read?entity_type=taxon&canonical_name=Phoebastria%20irrorata). However, since scientificNames are already standardized, it's not really worth it
scientificName
kingdom: set to Animalia yes
rank: available? almost always species, but subspecies could happen too

GPS

Empty fields

Reviewer of data:

The organismName and reproductiveCondition fields are empty and should be deleted.

My answer:

We developed a standardized approach for transforming data from a Movebank study to Darwin Core data (https://inbo.github.io/movepub/reference/write_dwc.html). This approach includes providing fields that might not be available/populated in the source data, such as organismName and reproductiveCondition. We could check and remove empty fields manually for each dataset, but we prefer to use the same (automated) approach for all datasets.

@sarahcd @tucotuco what would you suggest?

Set lifeStage for occurrences records when adult?

@sarahcd sex is now populated for the GPS records. Should we set lifeStage for those records to adult if the animal was an adult when captured?

Originally posted by @peterdesmet in #7 (comment)

Write tests for `write_dwc()`

Similar to camtraptor, this package could benefit from snapshot tests for write_dwc(). In contrast with camtraptor, write_dwc() writes both an eml.xml and dwc_occurrence.csv file. And there is no example dataset included in the package. You can use this dataset though:

o_assen <-
  frictionless::read_package("https://zenodo.org/record/5653311/files/datapackage.json") %>%
  # Remove the large acceleration resource we won't use (and thus won't download)
  frictionless::remove_resource("acceleration") %>%
  frictionless::write_package()

It can potentially be reduced to fewer animals.

Write helper function to download example dataset (so it can be used locally)
- Potentially reduce reference-data and gps data to a set of 3-4 animal-ids to reduce size
- Downloaded files should be removed after tests
- If the function is called directly from tests (rather assigned to a variable at the beginning of the tests), it should check if the local files are available?
Test expected Darwin Core mapping
Test expected EML file (note that there is a UUID that is expected to change)

@PietrH would you be willing to tackle this?

What reference data to include

@sarahcd so far I have based the mapping for the start and end HumanObservations on fields I have in my datasets. It would be good however to have a mapping based on all fields potentially available in the reference data. Can you check the field where you agree that they are not useful to map (crossed out) and comment on those you think would be useful to include?

	# Add span to circumvent https://github.com/ropensci/EML/issues/342
	"<span></span>This animal tracking dataset is derived from ",

inbo / movepub Goto Github PK

movepub's Issues

Still in write_dwc()

`dwc_occurrence_ref()

`dwc_occurrence_gps()

Record-level

Deployment

GPS

Recommend Projects

Recommend Topics

Recommend Org