Giter Site home page Giter Site logo

inbo / movepub Goto Github PK

View Code? Open in Web Editor NEW
3.0 8.0 1.0 6.4 MB

R package to prepare animal tracking data from Movebank for publication in a research repository or GBIF

Home Page: https://inbo.github.io/movepub/

License: Other

R 100.00%
oscibio r r-package rstats

movepub's Introduction

movepub

CRAN status R-CMD-check codecov repo status

Movepub is an R package to prepare animal tracking data from Movebank for publication in a research repository or the Global Biodiversity Information Facility (GBIF).

To get started, see:

  • Prepare your study: guidelines on how data owners should prepare their study in Movebank for archiving, prior to using this package.
  • Get started: an introduction to the package’s main functionalities.
  • Function reference: overview of all functions.

Note that Movebank users retain ownership of their data, and use should follow the general Movebank terms of use and any other license terms set by the owner.

Installation

You can install the development version of movepub from GitHub with:

# install.packages("devtools")
devtools::install_github("inbo/movepub")

Usage

This package supports two use cases:

Meta

  • We welcome contributions including bug reports.
  • License: MIT
  • Get citation information for movepub in R doing citation("movepub").
  • Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

movepub's People

Contributors

peterdesmet avatar pietrh avatar sarahcd avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

sarahcd rafnuss

movepub's Issues

Populate record level terms from DataCite

  • type
  • license

Could pull from DataCite, e.g. https://api.datacite.org/dois/application/vnd.datacite.datacite+xml/10.5441/001/1.vp4cf4qg:

<rightsList><rights rightsURI="http://creativecommons.org/publicdomain/zero/1.0/">Creative Commons Universal Public Domain Dedication (CC0 1.0)</rights>

Or Movebank REST API, e.g. https://www.movebank.org/movebank/service/direct-read?entity_type=study&study_id=2911040

license_type = CC_0
  • rightsHolder: INBO

Should typically be the institution that collected the data. Not currently represented in DataCite metadata from the repository. In Movebank, I plan to add a study-level attribute that combines institutionID and rightsHolder.

  • datasetID: https://doi.org/10.5281/zenodo.5879096

DOI as a URL, makes sense to always use a PURL and refer to the Movebank study elsewhere

  • institutionCode: MPIAB

institutionCode: = 'custody' of the data. Decided to use 'Max Planck Institute of Animal Behavior'.

  • collectionCode: Movebank

Given example 'ebird', can use 'Movebank'.

  • datasetName: O_WESTERSCHELDE - Eurasian oystercatchers (Haematopus ostralegus, Haematopodidae) breeding in East Flanders (Belgium) [subsampled representation]

Using a suffix to avoid systems like ORCID to consider the datasets as duplicates

  • informationWithheld: see https://doi.org/10.5281/zenodo.5879096 or see metadata

Imprecise values in minimumDistanceAboveSurfaceInMeters

Reviewer of data:

Please consider deleting the minimumDistanceAboveSurfaceInMeters field. In the O_AMELAND dataset the entries
range from -405 (depth) to 6467 (elevation), both of which are unbelievable and make the other entries suspicious.

My answer:

Values in this field are based on what is recorded as height by the GPS tracker. These records can show outliers. We prefer to provide the raw value (as available in the source record), rather than setting a cut-off, which can vary from project to project.

@sarahcd @tucotuco what would you suggest?

Order of data

@sarahcd the row order is strange, e.g., below. If I download the csv from Movebank, the records are not in this order, so that shouldn't be the reason.

image (3)

@sarahcd can you clarify what row order you would expect? First by animal_tag, then timestamp?

Define how to populate metadata

Hi @sarahcd, here's a first attempt at a Movebank dataset on GBIF: https://www.gbif-uat.org/dataset/0ef15f32-b41d-4274-ae96-eb5d0059fee6

Dataset that is included:

  • Basis metadata:
    • Title + [subsampled representation]
    • Language + type + update frequency + publishing org: to be set by user in IPT
    • License: set, but issue #19
    • Description: copied from source dataset, first paragraph added
    • Contact: provided by user or first creator
    • Creators: original dataset authors/creators
    • Metadata provider: same as contact
    • Funding sources: provided as separate paragraph in source dataset and thus copied to IPT
  • Geographic coverage: not set, not directly available in source dataset
  • Taxonomic coverage: could be derived from data, not sure if worth it?
  • Temporal coverage: could be derived from data, not sure if worth it?
  • Keywords: copied from source dataset
  • Associated parties: not set
  • Project data: not set
  • Sampling methods: not set
  • Citations
    • Resource citation: left to automatic one by GBIF
    • Bibliography: not set, could potentially be derived from relatedIdentifiers in DataCite, not sure if worth it? no
  • Collection data: not applicable
  • External links: website set to Movebank Study ID (as a link)
  • Additional metadata:
    • Pub date: set to source dataset but overwritten by IPT
    • Alternative identifier: Movebank Study ID (as a link)
    • Alternative identifier: DOI, but see #15

Populate `minimumHeightAboveSurfaceInMeters` from different fields

Comment by @sarahcd:

gps."height-above-ellipsoid" AS minimumDistanceAboveSurfaceInMeters
Some datasets have 'height-above-ellipsoid'. The datum is different but both fit the definition of minimumDistanceAboveSurfaceInMeters.
Is it possible to map conditionally depending on which attribute is present?
Rarely, both attributes are present. And there is also 'height-raw', but this does not define units so should be ignored.

Add alternative labels for `magnetic field raw x` and alike

Provide alternative labels for (official first, alternative second):

Original comment:

These fields are only defined without the mag part in the Movebank attribute dictionary (http://vocab.nerc.ac.uk/collection/MVB/current/MVB000151/), in contrast with e.g.

Exclude empty coordinates

@sarahcd is it possible for the coordinates of the gps records to be NULL or will those be marked as visible=false?

Important point, thanks for catching this! It is possible to have null (empty) coordinate fields that are not marked as outliers. (Sometimes there are other sensor measurements in the record, e.g., battery charge, that the user might want to retain in some contexts, so we don't automatically flag these.) Ideally this would be a preprocessing step, removing these before subsampling records.

Originally posted by @sarahcd in #7 (comment)

Add `study_id` parameter to `write_dwc()`

Indeed, the study URL is pulled from the second alternativeIdentifier in the EML:

study_url <- eml$dataset$alternateIdentifier[[2]]

To allow users to define it, we could:

  • Add a parameter study_id (from which the link can be build)
  • Have users correct add this when they edit metadata in the IPT

Given that the link is used in the description and as external link, I think a parameter would be good?

Originally posted by @peterdesmet in #25 (comment)

Publish first production dataset

@sarahcd @timrobertson100 I now have an operational function to create Darwin Core from a Movebank dataset: https://inbo.github.io/movepub/reference/write_dwc.html

I have used it to create:

I now want to use this to create the first production dataset, but would like some feedback on:

  • Is there sufficient metadata? See #9
  • Are we ok with the structure of the data? See also last remark at #7
  • What DOI to assign: #15

Make clickable aphia_id in message

write_dwc() currently returns the found aphia_ids:

library(frictionless)
library(movepub)
p <- read_package("https://zenodo.org/records/10055071/files/datapackage.json")
#> Please make sure you have the right to access data from this Data Package for your intended use.
#> Follow applicable norms or requirements to credit the dataset and its authors.
#> For more information, see https://doi.org/10.5281/zenodo.10055071
dwc <- write_dwc(p, directory = ".")
#> 
#> ── Reading data ──
#> 
#> ℹ Taxa found in reference data and their WoRMS AphiaID:
#> Circus cyaneus: 159372
#> Asio flammeus: 212685
#> Circus pygargus: 1037307
#> Circus aeruginosus: 558541
#> Buteo buteo: 558534
#> 
#> ── Transforming data to Darwin Core ──
#> 
#> ── Writing files ──
#> 
#> • 'eml_path'
#> • 'dwc_occurrence_path'

Created on 2023-12-22 with reprex v2.0.2

It would be nice if those aphia_ids were clickable links, which can be done with:

library(cli)
library(movepub)
taxon <- get_aphia_id("Circus cyaneus")
cli::cli_li("{taxon[['name']]}: {.href [{taxon['aphia_id']}]({taxon['aphia_url']})}.")
#> • Circus cyaneus: 159372
#> (<https://www.marinespecies.org/aphia.php?p=taxdetails&id=159372>).

Created on 2023-12-22 with reprex v2.0.2

But it takes some handling of NA and multiple values that I haven't been able to debug.

Add intro paragraph to dataset

Add an intro paragraph to the dataset:

This animal tracking dataset is derived from Spanoghe et al. (2022, https://doi.org/10.5281/zenodo.5879096), a deposit of Movebank study ID 1099562810. Data have been standardized to Darwin Core using the movepub R package and are downsampled to the first GPS position per hour. The original dataset description follows.

Note: The write_movebank_dwca() function is to perfect place to edit the eml and write it to the directory too.

Number of decimals in coordinates

Reviewer of data:

There is also minimal correlation between coordinate precision and cUIM. Four entries with 3 decimal places in
decimalLatitude have cUIM 2.8, 3.5, 4 and 5.9, although the precision in latitude for 3 decimal places is ±55.5 m
(https://en.wikipedia.org/wiki/Decimal_degrees#Precision).
(There are 256 decimalLatitude entries with 15 decimal places, e.g. "53.165282500000004". These have an implied
uncertainty of ±0.055 nanometers, or roughly half an atomic radius, and should have been rounded.)

My answer:

The decimal places of the coordinates are as is, i.e. how these are recorded in the source database and/or recorded by Movebank. Unfortunately that does include false precision (e.g. trailing 000 being dropped or more decimals than reasonable), but we prefer to provide the raw value (as available in the source record), rather than rounding/extending zeros.

@sarahcd @tucotuco what would you suggest? Should we round/extend?

Grouping events

The setup suggested in #7 has a grouping event (deployment) and occurrences (1 tag attachment, 0-more gps positions). In line with the new model, @timrobertson100 suggests to have a separate event for all occurrences (basically eventID = occurrenceID), and to group all GPS positions under the deployment using parentEventID. The deployment itself also has occurrence attached: the tag attachment. See table:

occurrenceID | eventID   | parentEventID | basisOfRecord | eventRemarks
------------ | --------- | ------------- | ------------- | ------------------
ani1_tag1    | ani1_tag1 |               | HumanObs      | deployment remarks
occ1         | occ1      | ani1_tag1     | MachineObs    |
occ2         | occ2      | ani1_tag1     | MachineObs    |

@sarahcd I'm trying this approach to see how it looks at GBIF. Given that the capture and deployment event are one and the same in this model, I would name it tag deployment rather than tag attachment.

Records outside of deployment

@sarahcd, in 2017 you remarked:

Published datasets in Movebank often include records (rows in the data file) that do not represent a location where an animal was observed:

Pre- or post-deployment records. For these 'individual-local-identifier' is blank. This will only occur rarely; I think none so far include them but we have a dataset in review with data from some undeployed test tags.

I assume all published datasets exclude points outside a deployment? Or should we filter on those in the SQL?

What reference data to include

@sarahcd so far I have based the mapping for the start and end HumanObservations on fields I have in my datasets. It would be good however to have a mapping based on all fields potentially available in the reference data. Can you check the field where you agree that they are not useful to map (crossed out) and comment on those you think would be useful to include?

  • animal-comments: start occurrenceRemarks?
  • animal-death-comments: end
  • animal-exact-date-of-birth
  • animal-id: organismID
  • animal-latest-date-born
  • animal-life-stage: lifeStage
  • animal-mass
  • animal-nickname: organismName
  • animal-reproductive-condition: reproductiveCondition
  • animal-ring-id: include, where?
  • animal-sex: sex
  • animal-taxon: scientificName
  • animal-taxon-detail
  • attachment-type: included in start eventRemarks
  • behavior-according-to
  • data-processing-software
  • deploy-off-date: end
  • deploy-off-latitude: end
  • deploy-off-longitude: end
  • deploy-off-person: end
  • deploy-on-date: start eventDate
  • deploy-on-latitude: start decimalLatitude
  • deploy-on-longitude: start decimalLongitude
  • deploy-on-person: start recordedBy?
  • deployment-comments: included in start eventRemarks
  • deployment-end-comments: end
  • deployment-end-type: end
  • deployment-id
  • duty-cycle
  • geolocator-calibration
  • geolocator-light-threshold
  • geolocator-sensor-comments
  • geolocator-sun-elevation-angle
  • habitat-according-to
  • location-accuracy-comments
  • manipulation-comments
  • manipulation-type: included in start eventRemarks
  • study-site
  • tag-beacon-frequency
  • tag-comments
  • tag-failure-comments
  • tag-id: part of eventID
  • tag-manufacturer-name: included in start eventRemarks
  • tag-mass
  • tag-model: included in start eventRemarks
  • tag-processing-type
  • tag-production-date
  • tag-readout-method
  • tag-serial-no

Movebank study ids are not 32 bit integers

In write_dwc there is an assertion

assertthat::assert_that(
    !is.na(as.integer(study_id)),
    msg = glue::glue("`study_id` ({study_id}) must be an integer.")
  )

This is an error because movebank study ids are not necessarily 32 bit integers, eg

https://www.movebank.org/cms/webapp?gwt_fragment=page=studies,path=study2391441038

but 2391441038 is not a 32 bit integer.

Could possibly replace with

assertthat::assert_that(
    regexpr("^\\d+$",as.character(study_id))==1,
    msg = glue::glue("`study_id` ({study_id}) must be an integer.")
  )

Create function to write EML

See https://docs.ropensci.org/EML/articles/creating-EML.html

  • alternateIdentifier: see how GBIF handles this #15
  • title: add suffix
  • creator
    • givenName
    • surName
    • organizationName: not intended for affiliation
    • email: don't have this info
    • ORCID: directory not set, see #16
  • metadataProvider
    • givenName
    • surName
    • organizationName: not intended for affiliation
    • email
    • ORCID
  • pubDate
  • language: not set
  • abstract
    • para(s): HTML correctly escaped
  • keywordSet
    • keyword(s)
  • intellectualRights: not read correctly if URL is provided
  • distribution$online$url
  • contact
    • givenName
    • surName
    • organizationName: not intended for affiliation
    • email
    • ORCID

Add tests for `get_aphia_id()`

The helper function get_aphia_id() is currently 1) hidden and 2) has no tests.

  • It might be good to expose it as a public function done in 02333a8
  • Add tests, e.g. for:
    • get_aphia_id("Mola mola") Should return 1 record
    • get_aphia_id(c("Mola mola", "not_a_name")) Should return 2 records, one NA
    • get_aphia_id(c("Mola mola", "?")) ? is handled differently by wm_name2id_(), but should return NA
    • get_aphia_id("Pisces") Unaccepted taxa, should return as normal
    • get_aphia_id(c("not_a_name")) Should return NA, but currently has an error

Very imprecise coordinateUncertaintyInMeters

Reviewer of data:

65 of the entries in the O_AMELAND dataset are "999998.9", or 1000 km, and indicate that the coordinates for these
entries cannot be trusted. If that cUIM is plausible, those 65 records should be discarded.

My answer:

Before records are uploaded to Movebank (from which they are derived for GBIF) we calculate whether or not they constitute an outlier based on speed and angle (https://github.com/inbo/bird-tracking/blob/master/src/outliers.Rmd#L55), not (potentially imprecise) height and precision values. We exclude outliers when publishing to GBIF, we prefer not to exclude records based on the precision.

@sarahcd @tucotuco what would you suggest?

Empty fields

Reviewer of data:

The organismName and reproductiveCondition fields are empty and should be deleted.

My answer:

We developed a standardized approach for transforming data from a Movebank study to Darwin Core data (https://inbo.github.io/movepub/reference/write_dwc.html). This approach includes providing fields that might not be available/populated in the source data, such as organismName and reproductiveCondition. We could check and remove empty fields manually for each dataset, but we prefer to use the same (automated) approach for all datasets.

@sarahcd @tucotuco what would you suggest?

Handle missing fields

E.g. animal-reproductive-condition can be mapped, but is not in the source data we have. The function should be to handle that.

Potential fields:

  • ref.animal-reproductive-condition AS reproductiveCondition
  • gps.comments AS eventRemarks
  • ref.tag-model in eventRemarks

Terms like `deploy-on measurements` written with hyphen in NERC

@sarahcd I noticed that the following (new) terms have labels that are written with hyphen:

This in contrast with previous terms:

The extra hyphen makes it annoying to find the term, since we ignore all -, _, and . to look up a term. Is it possible to change the labels to make them consistent?

See how IPT handles `userId` in EML

Expected:

<userId directory="http://orcid.org/">0000-0002-8442-8025</userId>

Current output:

<userId>https://orcid.org/0000-0002-8442-8025</userId>

Write tests for `write_dwc()`

Similar to camtraptor, this package could benefit from snapshot tests for write_dwc(). In contrast with camtraptor, write_dwc() writes both an eml.xml and dwc_occurrence.csv file. And there is no example dataset included in the package. You can use this dataset though:

o_assen <-
  frictionless::read_package("https://zenodo.org/record/5653311/files/datapackage.json") %>%
  # Remove the large acceleration resource we won't use (and thus won't download)
  frictionless::remove_resource("acceleration") %>%
  frictionless::write_package()

It can potentially be reduced to fewer animals.

  • Write helper function to download example dataset (so it can be used locally)
    • Potentially reduce reference-data and gps data to a set of 3-4 animal-ids to reduce size
    • Downloaded files should be removed after tests
    • If the function is called directly from tests (rather assigned to a variable at the beginning of the tests), it should check if the local files are available?
  • Test expected Darwin Core mapping
  • Test expected EML file (note that there is a UUID that is expected to change)

@PietrH would you be willing to tackle this?

Sampling protocol

@sarahcd what sampling protocol to use:

  • Deployment start: tag deployment start?
  • GPS positions: gps or gps sensor?
  • Deployment end: tag deployment end?

Split `write_dwc()` into `write_dwc()` and `write_eml()`

Ideally, write_dwc() can be used in an operational pipeline and doesn't generate errors. Currently, that is not the case, since it requires e.g. a rightsholder, contact person, etc. Does properties are actually required for the EML part, not the Darwin Core transformations.

Get data type from Movebank Attribute Dictionary

In a future update of the Movebank Attribute Dictionary, data type will be indicated in the definition. Rather than letting create_schema() guess the types, it would be better if those are retrieved from the MVB Dictionary.

Provide copy of `read_package()` and `write_package()` from frictionless

Currently, users of movepub have to load the frictionless package to work with datasets (see https://inbo.github.io/movepub/articles/movepub.html).

It might be useful if the functions read_package() and write_package() are copied from frictionless and exposed through movepub, cf. how readr::problems() is exposed in camtraptor. That also avoids the name clash warning between frictionless::add_resource() and movepub::add_resource().

DwC mapping of Movebank

Record-level

See #12

Deployment

  • basisOfRecord
  • occurrenceID: don't use deploymentID (not always assigned), use same value as dwc:eventID
  • individualCount: add? no
  • sex: unknown -> undetermined left as unknown
  • lifeStage: map as is? yes
  • occurrenceStatus
  • organismID
  • organismName
  • eventID: use ref."animal-id" || '_' || ref."tag-id" (different order, not a hyphen)
  • samplingProtocol: tag deployment? use tag attachment
  • eventDate
  • eventRemarks: yes, use this since it will show on GBIF. Also include more information:
ref."tag-manufacturer-name" + ref."tag-model" "tag attached to "
CASE
 WHEN ref."manipulation-type" = 'none' THEN 'free-ranging animal'
 WHEN ref."manipulation-type" != 'none' THEN 'manipulated animal (see original dataset)' -- Or exclude these animals from GBIF. 
END
"by" ref."attachment-type"
  • minimumDistanceAboveSurfaceInMeters: NULL
  • decimalLatitude: round to 5? keep as is
  • decimalLongitude: round to 5? keep as is
  • geodeticDatum Make conditional on decimalLatitude
  • coordinateUncertaintyInMeters: 30? Use 1000, make conditional on decimalLatitude
  • taxonID: available? Yes, in database (ITIS TSN), but would have to be retrieved through https://www.movebank.org/movebank/service/direct-read?entity_type=taxon&canonical_name=Phoebastria%20irrorata). However, since scientificNames are already standardized, it's not really worth it
  • scientificName
  • kingdom: set to Animalia yes
  • rank: available? almost always species, but subspecies could happen too

GPS

  • basisOfRecord
  • dataGeneralizations: Example subsampled by hour: first of x records`? ok
  • occurrenceID: prefix with something? movebank:00000 use identifier as is
  • sex: NULL or map? Now mapped
  • lifeStage: NULL, see #11
  • occurrenceStatus
  • organismID
  • organismName
  • eventID: update
  • samplingProtocol: gps."sensor-type"? yes
  • eventDate
  • eventRemarks: NULL
  • minimumDistanceAboveSurfaceInMeters: height-above-msl
  • decimalLatitude: round to 5? keep as is
  • decimalLongitude: round to 5? keep as is
  • geodeticDatum
  • coordinateUncertaintyInMeters
  • scientificName
  • kingdom: set to Animalia yes

Include `scientificNameID` to make datasets valid for OBIS

This would require a call to WoRMS, with the worrms package. I see no function to get the LSID directly, so it would be something like:

aphia_id <- worrms::wm_name2id(name = "Larus fuscus")
scientific_name_id <- paste("urn:lsid:marinespecies.org:taxname", aphia_id, sep = ":") # but keep empty if aphia_id is empty

Very precise coordinateUncertaintyInMeters

Reviewer of the data:

Please consider recalculating or re-estimating the entries in the coordinateUncertaintyInMeters field. I would have
expected that a default uncertainty could be provided for a GPS tracking unit (or a handheld GPS). Instead, the entries
are hugely varied and given to 0.1 m, which (again) is unbelievable.

My answer:

Values in this field are based on what is recorded as horizontal precision by the GPS tracker. These records can show outliers. We prefer to provide the raw value (as available in the source record), rather than setting a cut-off (e.g. the default 30m as lower value for GPS).

@sarahcd @tucotuco what would you suggest?

License not recognized by GBIF

See gbif/ipt#1778

This is recognized by the IPT editor, but not by other systems:

<intellectualRights>
      <para>Public Domain (CC0 1.0)</para>
      <rights>Creative Commons Zero v1.0 Universal</rights>
      <rightsIdentifier>cc0-1.0</rightsIdentifier>
      <rightsIdentifierScheme>SPDX</rightsIdentifierScheme>
      <rightsUri>https://creativecommons.org/publicdomain/zero/1.0/legalcode</rightsUri>
      <schemeUri>https://spdx.org/licenses/</schemeUri>
</intellectualRights>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.