inbo / movepub Goto Github PK

View Code? Open in Web Editor NEW

3.0 8.0 1.0 6.4 MB

R package to prepare animal tracking data from Movebank for publication in a research repository or GBIF

Home Page: https://inbo.github.io/movepub/

License: Other

R 100.00%

oscibio r r-package rstats

movepub's Introduction

movepub

Movepub is an R package to prepare animal tracking data from Movebank for publication in a research repository or the Global Biodiversity Information Facility (GBIF).

To get started, see:

Prepare your study: guidelines on how data owners should prepare their study in Movebank for archiving, prior to using this package.
Get started: an introduction to the package’s main functionalities.
Function reference: overview of all functions.

Note that Movebank users retain ownership of their data, and use should follow the general Movebank terms of use and any other license terms set by the owner.

Installation

You can install the development version of movepub from GitHub with:

# install.packages("devtools")
devtools::install_github("inbo/movepub")

Usage

This package supports two use cases:

Make a Movebank dataset “frictionless” with add_resource(). This is useful when publishing a dataset on e.g. Zenodo.
Transform a Movebank dataset to Darwin Core with write_dwc(). This is necessary when publishing a dataset to GBIF.

movepub's People

Contributors

Stargazers

Watchers

Forkers

sarahcd rafnuss

movepub's Issues

Write orcids as https

Update http to https in:

movepub/R/write_dwc.R

Line 146 in afffbde

list(directory = "http://orcid.org/", contact$comment[["ORCID"]])

Once gbif/ipt#1819 is included in the next version of the IPT.

Update message when loading

@sarahcd in #36 (comment):

... One additional request is for the messages shown as the code is executed. Where messages say 'Please make sure you have the right to access data....', please somewhere refer to 'Movebank’s terms of use at https://www.movebank.org/cms/movebank-content/general-movebank-terms-of-use'. If you point me to the file/s with the message text I can suggest it there directly.

Populate record level terms from DataCite

type
license

Could pull from DataCite, e.g. https://api.datacite.org/dois/application/vnd.datacite.datacite+xml/10.5441/001/1.vp4cf4qg:

<rightsList><rights rightsURI="http://creativecommons.org/publicdomain/zero/1.0/">Creative Commons Universal Public Domain Dedication (CC0 1.0)</rights>

Or Movebank REST API, e.g. https://www.movebank.org/movebank/service/direct-read?entity_type=study&study_id=2911040

license_type = CC_0

rightsHolder: INBO

Should typically be the institution that collected the data. Not currently represented in DataCite metadata from the repository. In Movebank, I plan to add a study-level attribute that combines institutionID and rightsHolder.

datasetID: https://doi.org/10.5281/zenodo.5879096

DOI as a URL, makes sense to always use a PURL and refer to the Movebank study elsewhere

institutionCode: MPIAB

institutionCode: = 'custody' of the data. Decided to use 'Max Planck Institute of Animal Behavior'.

collectionCode: Movebank

Given example 'ebird', can use 'Movebank'.

datasetName: O_WESTERSCHELDE - Eurasian oystercatchers (Haematopus ostralegus, Haematopodidae) breeding in East Flanders (Belgium) [subsampled representation]

Using a suffix to avoid systems like ORCID to consider the datasets as duplicates

informationWithheld: see https://doi.org/10.5281/zenodo.5879096 or see metadata

Set lifeStage for occurrences records when adult?

@sarahcd sex is now populated for the GPS records. Should we set lifeStage for those records to adult if the animal was an adult when captured?

Originally posted by @peterdesmet in #7 (comment)

Try `write_dwc()` on non-Zenodo dataset

@sarahcd do you have a good example dataset in the Movebank repository that I can use to test if the write_dwc() function handles it elegantly?

Imprecise values in minimumDistanceAboveSurfaceInMeters

Reviewer of data:

Please consider deleting the minimumDistanceAboveSurfaceInMeters field. In the O_AMELAND dataset the entries
range from -405 (depth) to 6467 (elevation), both of which are unbelievable and make the other entries suspicious.

My answer:

Values in this field are based on what is recorded as height by the GPS tracker. These records can show outliers. We prefer to provide the raw value (as available in the source record), rather than setting a cut-off, which can vary from project to project.

@sarahcd @tucotuco what would you suggest?

Paragraph needs to start with HTML to be recognized as `CDATA`

I currently used a hack to circumvent ropensci/EML#342 Would be nice if I didn't have to:

movepub/R/write_dwc.R

Lines 64 to 65 in ad44aaa

    
           # Add span to circumvent https://github.com/ropensci/EML/issues/342 
        
           "<span></span>This animal tracking dataset is derived from ",

Order of data

@sarahcd the row order is strange, e.g., below. If I download the csv from Movebank, the records are not in this order, so that shouldn't be the reason.

@sarahcd can you clarify what row order you would expect? First by animal_tag, then timestamp?

Define how to populate metadata

Hi @sarahcd, here's a first attempt at a Movebank dataset on GBIF: https://www.gbif-uat.org/dataset/0ef15f32-b41d-4274-ae96-eb5d0059fee6

Dataset that is included:

Populate `minimumHeightAboveSurfaceInMeters` from different fields

Comment by @sarahcd:

gps."height-above-ellipsoid" AS minimumDistanceAboveSurfaceInMeters
Some datasets have 'height-above-ellipsoid'. The datum is different but both fit the definition of minimumDistanceAboveSurfaceInMeters.
Is it possible to map conditionally depending on which attribute is present?
Rarely, both attributes are present. And there is also 'height-raw', but this does not define units so should be ignored.

Add alternative labels for `magnetic field raw x` and alike

Provide alternative labels for (official first, alternative second):

magnetic field raw x:mag magnetic field raw x
magnetic field raw y: mag magnetic field raw y
magnetic field raw z: mag magnetic field raw z
light level: gls light level
Ornitela transmission protocol: orn transmission protocol
~~study time zone~~: study timezone is now the official term http://vocab.nerc.ac.uk/collection/MVB/current/MVB000177/
~~local timestamp~~: study local timestamp is not the official term http://vocab.nerc.ac.uk/collection/MVB/current/MVB000140/

Original comment:

These fields are only defined without the mag part in the Movebank attribute dictionary (http://vocab.nerc.ac.uk/collection/MVB/current/MVB000151/), in contrast with e.g.

gps:hdop: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000118/
eobs:key-bin-checksum: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000095/
bar:barometric-pressure: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000055/ (alt label)

Exclude empty coordinates

@sarahcd is it possible for the coordinates of the gps records to be NULL or will those be marked as visible=false?

Important point, thanks for catching this! It is possible to have null (empty) coordinate fields that are not marked as outliers. (Sometimes there are other sensor measurements in the record, e.g., battery charge, that the user might want to retain in some contexts, so we don't automatically flag these.) Ideally this would be a preprocessing step, removing these before subsampling records.

Originally posted by @sarahcd in #7 (comment)

Add `study_id` parameter to `write_dwc()`

Indeed, the study URL is pulled from the second alternativeIdentifier in the EML:

movepub/R/write_dwc.R

Line 96 in d9f1a2d

study_url <- eml$dataset$alternateIdentifier[[2]]

To allow users to define it, we could:

Add a parameter study_id (from which the link can be build)
Have users correct add this when they edit metadata in the IPT

Given that the link is used in the description and as external link, I think a parameter would be good?

Originally posted by @peterdesmet in #25 (comment)

Create vignette on how to create data package

Publish first production dataset

@sarahcd @timrobertson100 I now have an operational function to create Darwin Core from a Movebank dataset: https://inbo.github.io/movepub/reference/write_dwc.html

I have used it to create:

I now want to use this to create the first production dataset, but would like some feedback on:

Is there sufficient metadata? See #9
Are we ok with the structure of the data? See also last remark at #7
What DOI to assign: #15

Map `deploy-on-measurements` and other new fields to Darwin Core

In inbo/bird-tracking#197 a couple new metadata fields were added:

alt-project-id (new field): no need to map
deploy-on-measurements (new field): could be mapped to dynamicProperties
deployment-comments: now contains reduced content
study-site (release_location): could be mapped to locality for human obs
tag-firmware: no need to map

Make clickable aphia_id in message

write_dwc() currently returns the found aphia_ids:

library(frictionless)
library(movepub)
p <- read_package("https://zenodo.org/records/10055071/files/datapackage.json")
#> Please make sure you have the right to access data from this Data Package for your intended use.
#> Follow applicable norms or requirements to credit the dataset and its authors.
#> For more information, see https://doi.org/10.5281/zenodo.10055071
dwc <- write_dwc(p, directory = ".")
#> 
#> ── Reading data ──
#> 
#> ℹ Taxa found in reference data and their WoRMS AphiaID:
#> Circus cyaneus: 159372
#> Asio flammeus: 212685
#> Circus pygargus: 1037307
#> Circus aeruginosus: 558541
#> Buteo buteo: 558534
#> 
#> ── Transforming data to Darwin Core ──
#> 
#> ── Writing files ──
#> 
#> • 'eml_path'
#> • 'dwc_occurrence_path'

^{Created on 2023-12-22 with reprex v2.0.2}

It would be nice if those aphia_ids were clickable links, which can be done with:

library(cli)
library(movepub)
taxon <- get_aphia_id("Circus cyaneus")
cli::cli_li("{taxon[['name']]}: {.href [{taxon['aphia_id']}]({taxon['aphia_url']})}.")
#> • Circus cyaneus: 159372
#> (<https://www.marinespecies.org/aphia.php?p=taxdetails&id=159372>).

^{Created on 2023-12-22 with reprex v2.0.2}

But it takes some handling of NA and multiple values that I haven't been able to debug.

Fix links in vignettes/movepub.Rmd

The links in vignettes/movepub.Rmd are broken. Sarah provided new links: #49 (comment)

Add intro paragraph to dataset

Add an intro paragraph to the dataset:

This animal tracking dataset is derived from Spanoghe et al. (2022, https://doi.org/10.5281/zenodo.5879096), a deposit of Movebank study ID 1099562810. Data have been standardized to Darwin Core using the movepub R package and are downsampled to the first GPS position per hour. The original dataset description follows.

Note: The write_movebank_dwca() function is to perfect place to edit the eml and write it to the directory too.

Number of decimals in coordinates

Reviewer of data:

There is also minimal correlation between coordinate precision and cUIM. Four entries with 3 decimal places in
decimalLatitude have cUIM 2.8, 3.5, 4 and 5.9, although the precision in latitude for 3 decimal places is ±55.5 m
(https://en.wikipedia.org/wiki/Decimal_degrees#Precision).
(There are 256 decimalLatitude entries with 15 decimal places, e.g. "53.165282500000004". These have an implied
uncertainty of ±0.055 nanometers, or roughly half an atomic radius, and should have been rounded.)

My answer:

The decimal places of the coordinates are as is, i.e. how these are recorded in the source database and/or recorded by Movebank. Unfortunately that does include false precision (e.g. trailing 000 being dropped or more decimals than reasonable), but we prefer to provide the raw value (as available in the source record), rather than rounding/extending zeros.

@sarahcd @tucotuco what would you suggest? Should we round/extend?

Grouping events

The setup suggested in #7 has a grouping event (deployment) and occurrences (1 tag attachment, 0-more gps positions). In line with the new model, @timrobertson100 suggests to have a separate event for all occurrences (basically eventID = occurrenceID), and to group all GPS positions under the deployment using parentEventID. The deployment itself also has occurrence attached: the tag attachment. See table:

occurrenceID | eventID   | parentEventID | basisOfRecord | eventRemarks
------------ | --------- | ------------- | ------------- | ------------------
ani1_tag1    | ani1_tag1 |               | HumanObs      | deployment remarks
occ1         | occ1      | ani1_tag1     | MachineObs    |
occ2         | occ2      | ani1_tag1     | MachineObs    |

@sarahcd I'm trying this approach to see how it looks at GBIF. Given that the capture and deployment event are one and the same in this model, I would name it tag deployment rather than tag attachment.

Records outside of deployment

@sarahcd, in 2017 you remarked:

Published datasets in Movebank often include records (rows in the data file) that do not represent a location where an animal was observed:

Pre- or post-deployment records. For these 'individual-local-identifier' is blank. This will only occur rarely; I think none so far include them but we have a dataset in review with data from some undeployed test tags.

I assume all published datasets exclude points outside a deployment? Or should we filter on those in the SQL?

What reference data to include

@sarahcd so far I have based the mapping for the start and end HumanObservations on fields I have in my datasets. It would be good however to have a mapping based on all fields potentially available in the reference data. Can you check the field where you agree that they are not useful to map (crossed out) and comment on those you think would be useful to include?

Movebank study ids are not 32 bit integers

In write_dwc there is an assertion

assertthat::assert_that(
    !is.na(as.integer(study_id)),
    msg = glue::glue("`study_id` ({study_id}) must be an integer.")
  )

This is an error because movebank study ids are not necessarily 32 bit integers, eg

https://www.movebank.org/cms/webapp?gwt_fragment=page=studies,path=study2391441038

but 2391441038 is not a 32 bit integer.

Could possibly replace with

assertthat::assert_that(
    regexpr("^\\d+$",as.character(study_id))==1,
    msg = glue::glue("`study_id` ({study_id}) must be an integer.")
  )

Create function to write EML

See https://docs.ropensci.org/EML/articles/creating-EML.html

Deprecate `get_mvb_term()`

See if https://bartk.gitlab.io/move2/reference/movebank_get_vocabulary.html can replace this function.

Source DOI as alternative identifier in EML?

Me:

Should the source DOI be an alternative identifier for the dataset in EML?

@timrobertson100:

I think so. I’m not sure if GBIF will then use it though… can you try perhaps?

Add tests for `get_aphia_id()`

The helper function get_aphia_id() is currently 1) hidden and 2) has no tests.

It might be good to expose it as a public function done in 02333a8
Add tests, e.g. for:
- get_aphia_id("Mola mola") Should return 1 record
- get_aphia_id(c("Mola mola", "not_a_name")) Should return 2 records, one NA
- get_aphia_id(c("Mola mola", "?")) ? is handled differently by wm_name2id_(), but should return NA
- get_aphia_id("Pisces") Unaccepted taxa, should return as normal
- get_aphia_id(c("not_a_name")) Should return NA, but currently has an error

Very imprecise coordinateUncertaintyInMeters

Reviewer of data:

65 of the entries in the O_AMELAND dataset are "999998.9", or 1000 km, and indicate that the coordinates for these
entries cannot be trusted. If that cUIM is plausible, those 65 records should be discarded.

My answer:

Before records are uploaded to Movebank (from which they are derived for GBIF) we calculate whether or not they constitute an outlier based on speed and angle (https://github.com/inbo/bird-tracking/blob/master/src/outliers.Rmd#L55), not (potentially imprecise) height and precision values. We exclude outliers when publishing to GBIF, we prefer not to exclude records based on the precision.

@sarahcd @tucotuco what would you suggest?

Empty fields

Reviewer of data:

The organismName and reproductiveCondition fields are empty and should be deleted.

My answer:

We developed a standardized approach for transforming data from a Movebank study to Darwin Core data (https://inbo.github.io/movepub/reference/write_dwc.html). This approach includes providing fields that might not be available/populated in the source data, such as organismName and reproductiveCondition. We could check and remove empty fields manually for each dataset, but we prefer to use the same (automated) approach for all datasets.

@sarahcd @tucotuco what would you suggest?

Change altLabel for `barometric pressure`

http://vocab.nerc.ac.uk/collection/MVB/current/MVB000055/ currently has the altLabel bar:barometric pressure. That makes it harder to parse with get_movebank_term(). @sarahcd will update this in a next release to:

bar barometric pressure

Handle missing fields

E.g. animal-reproductive-condition can be mapped, but is not in the source data we have. The function should be to handle that.

Potential fields:

ref.animal-reproductive-condition AS reproductiveCondition
gps.comments AS eventRemarks
ref.tag-model in eventRemarks

Terms like `deploy-on measurements` written with hyphen in NERC

@sarahcd I noticed that the following (new) terms have labels that are written with hyphen:

deploy-off measurements: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000354/
deploy-off sampling: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000355/
deploy-on measurements: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000356/
deploy-on sampling: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000357/
GSM MCC-MNC: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000125/
animal birth-hatch latitude: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000338/
animal birth-hatch longitude: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000339/
study-specific measurement: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000178/

This in contrast with previous terms:

deploy off timestamp, etc.: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000077/
deploy on timestamp, etc: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000081/
animal death comments, etc.: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000013/
In fact, none of the other terms are written with hyphen.

The extra hyphen makes it annoying to find the term, since we ignore all -, _, and . to look up a term. Is it possible to change the labels to make them consistent?

See how IPT handles `userId` in EML

Expected:

<userId directory="http://orcid.org/">0000-0002-8442-8025</userId>

Current output:

<userId>https://orcid.org/0000-0002-8442-8025</userId>

Write tests for `write_dwc()`

Similar to camtraptor, this package could benefit from snapshot tests for write_dwc(). In contrast with camtraptor, write_dwc() writes both an eml.xml and dwc_occurrence.csv file. And there is no example dataset included in the package. You can use this dataset though:

o_assen <-
  frictionless::read_package("https://zenodo.org/record/5653311/files/datapackage.json") %>%
  # Remove the large acceleration resource we won't use (and thus won't download)
  frictionless::remove_resource("acceleration") %>%
  frictionless::write_package()

It can potentially be reduced to fewer animals.

Write helper function to download example dataset (so it can be used locally)
- Potentially reduce reference-data and gps data to a set of 3-4 animal-ids to reduce size
- Downloaded files should be removed after tests
- If the function is called directly from tests (rather assigned to a variable at the beginning of the tests), it should check if the local files are available?
Test expected Darwin Core mapping
Test expected EML file (note that there is a UUID that is expected to change)

@PietrH would you be willing to tackle this?

Sampling protocol

@sarahcd what sampling protocol to use:

Deployment start: tag deployment start?
GPS positions: gps or gps sensor?
Deployment end: tag deployment end?

Create vignette on how to transform to Darwin Core

Unrecognized character in definition for `GSM signal strength`

See http://vocab.nerc.ac.uk/collection/MVB/current/MVB000126/:

An integer proportional to the GSM signal strength measured by the tag. Valid values are 0�31, or 99 for unknown or undetectable. Higher values indicate better signal strength. Example: '39'; Units: arbitrary strength units (asu); Entity described: event

Split `write_dwc()` into `write_dwc()` and `write_eml()`

Ideally, write_dwc() can be used in an operational pipeline and doesn't generate errors. Currently, that is not the case, since it requires e.g. a rightsholder, contact person, etc. Does properties are actually required for the EML part, not the Darwin Core transformations.

Get data type from Movebank Attribute Dictionary

In a future update of the Movebank Attribute Dictionary, data type will be indicated in the definition. Rather than letting create_schema() guess the types, it would be better if those are retrieved from the MVB Dictionary.

Provide copy of `read_package()` and `write_package()` from frictionless

Currently, users of movepub have to load the frictionless package to work with datasets (see https://inbo.github.io/movepub/articles/movepub.html).

It might be useful if the functions read_package() and write_package() are copied from frictionless and exposed through movepub, cf. how readr::problems() is exposed in camtraptor. That also avoids the name clash warning between frictionless::add_resource() and movepub::add_resource().

DwC mapping of Movebank

Record-level

See #12

Deployment

ref."tag-manufacturer-name" + ref."tag-model" "tag attached to "
CASE
 WHEN ref."manipulation-type" = 'none' THEN 'free-ranging animal'
 WHEN ref."manipulation-type" != 'none' THEN 'manipulated animal (see original dataset)' -- Or exclude these animals from GBIF. 
END
"by" ref."attachment-type"

minimumDistanceAboveSurfaceInMeters: NULL
decimalLatitude: round to 5? keep as is
decimalLongitude: round to 5? keep as is
geodeticDatum Make conditional on decimalLatitude
coordinateUncertaintyInMeters: 30? Use 1000, make conditional on decimalLatitude
taxonID: available? Yes, in database (ITIS TSN), but would have to be retrieved through https://www.movebank.org/movebank/service/direct-read?entity_type=taxon&canonical_name=Phoebastria%20irrorata). However, since scientificNames are already standardized, it's not really worth it
scientificName
kingdom: set to Animalia yes
rank: available? almost always species, but subspecies could happen too

GPS

Include `scientificNameID` to make datasets valid for OBIS

This would require a call to WoRMS, with the worrms package. I see no function to get the LSID directly, so it would be something like:

aphia_id <- worrms::wm_name2id(name = "Larus fuscus")
scientific_name_id <- paste("urn:lsid:marinespecies.org:taxname", aphia_id, sep = ":") # but keep empty if aphia_id is empty

Very precise coordinateUncertaintyInMeters

Reviewer of the data:

Please consider recalculating or re-estimating the entries in the coordinateUncertaintyInMeters field. I would have
expected that a default uncertainty could be provided for a GPS tracking unit (or a handheld GPS). Instead, the entries
are hugely varied and given to 0.1 m, which (again) is unbelievable.

My answer:

Values in this field are based on what is recorded as horizontal precision by the GPS tracker. These records can show outliers. We prefer to provide the raw value (as available in the source record), rather than setting a cut-off (e.g. the default 30m as lower value for GPS).

@sarahcd @tucotuco what would you suggest?

Fix bug with updated glue

glue 1.7.0 includes some changes to glue::glue_sql() which has introduced a bug in write_dwc():

Error: Expected string vector of length 1

License not recognized by GBIF

See gbif/ipt#1778

This is recognized by the IPT editor, but not by other systems:

<intellectualRights>
      <para>Public Domain (CC0 1.0)</para>
      <rights>Creative Commons Zero v1.0 Universal</rights>
      <rightsIdentifier>cc0-1.0</rightsIdentifier>
      <rightsIdentifierScheme>SPDX</rightsIdentifierScheme>
      <rightsUri>https://creativecommons.org/publicdomain/zero/1.0/legalcode</rightsUri>
      <schemeUri>https://spdx.org/licenses/</schemeUri>
</intellectualRights>

	# Add span to circumvent https://github.com/ropensci/EML/issues/342
	"<span></span>This animal tracking dataset is derived from ",

inbo / movepub Goto Github PK

movepub's Introduction

movepub

Installation

Usage

Meta

movepub's People

Contributors

Stargazers

Watchers

Forkers

movepub's Issues

Record-level

Deployment

GPS

Recommend Projects

Recommend Topics

Recommend Org