Giter Site home page Giter Site logo

inbo / movepub Goto Github PK

View Code? Open in Web Editor NEW
3.0 8.0 1.0 6.71 MB

R package to prepare animal tracking data from Movebank for publication in a research repository or GBIF

Home Page: https://inbo.github.io/movepub/

License: Other

R 100.00%
oscibio r r-package rstats

movepub's Issues

Provide copy of `read_package()` and `write_package()` from frictionless

Currently, users of movepub have to load the frictionless package to work with datasets (see https://inbo.github.io/movepub/articles/movepub.html).

It might be useful if the functions read_package() and write_package() are copied from frictionless and exposed through movepub, cf. how readr::problems() is exposed in camtraptor. That also avoids the name clash warning between frictionless::add_resource() and movepub::add_resource().

Add intro paragraph to dataset

Add an intro paragraph to the dataset:

This animal tracking dataset is derived from Spanoghe et al. (2022, https://doi.org/10.5281/zenodo.5879096), a deposit of Movebank study ID 1099562810. Data have been standardized to Darwin Core using the movepub R package and are downsampled to the first GPS position per hour. The original dataset description follows.

Note: The write_movebank_dwca() function is to perfect place to edit the eml and write it to the directory too.

Use dplyr for write_dwc()

write_dwc() currently uses sqlite for the transformation. This was chosen so that the transformation itself could be written in SQL, which is more universally understood. Those SQL files are referenced in the function documentation: https://inbo.github.io/movepub/reference/write_dwc.html#data

Annoyingly, this adds two dependencies (RSQLite, DBI) and it is currently affected by a change in glue: #59. Just like in camtraptor/camtrapdp (see inbo/camtraptor#207), I suggest to move the transformation from SQL to dplyr::mutate(). dplyr is already a dependency.

Todo:

Add alternative labels for `magnetic field raw x` and alike

Provide alternative labels for (official first, alternative second):

Original comment:

These fields are only defined without the mag part in the Movebank attribute dictionary (http://vocab.nerc.ac.uk/collection/MVB/current/MVB000151/), in contrast with e.g.

Get data type from Movebank Attribute Dictionary

In a future update of the Movebank Attribute Dictionary, data type will be indicated in the definition. Rather than letting create_schema() guess the types, it would be better if those are retrieved from the MVB Dictionary.

Populate `minimumHeightAboveSurfaceInMeters` from different fields

Comment by @sarahcd:

gps."height-above-ellipsoid" AS minimumDistanceAboveSurfaceInMeters
Some datasets have 'height-above-ellipsoid'. The datum is different but both fit the definition of minimumDistanceAboveSurfaceInMeters.
Is it possible to map conditionally depending on which attribute is present?
Rarely, both attributes are present. And there is also 'height-raw', but this does not define units so should be ignored.

Very imprecise coordinateUncertaintyInMeters

Reviewer of data:

65 of the entries in the O_AMELAND dataset are "999998.9", or 1000 km, and indicate that the coordinates for these
entries cannot be trusted. If that cUIM is plausible, those 65 records should be discarded.

My answer:

Before records are uploaded to Movebank (from which they are derived for GBIF) we calculate whether or not they constitute an outlier based on speed and angle (https://github.com/inbo/bird-tracking/blob/master/src/outliers.Rmd#L55), not (potentially imprecise) height and precision values. We exclude outliers when publishing to GBIF, we prefer not to exclude records based on the precision.

@sarahcd @tucotuco what would you suggest?

Grouping events

The setup suggested in #7 has a grouping event (deployment) and occurrences (1 tag attachment, 0-more gps positions). In line with the new model, @timrobertson100 suggests to have a separate event for all occurrences (basically eventID = occurrenceID), and to group all GPS positions under the deployment using parentEventID. The deployment itself also has occurrence attached: the tag attachment. See table:

occurrenceID | eventID   | parentEventID | basisOfRecord | eventRemarks
------------ | --------- | ------------- | ------------- | ------------------
ani1_tag1    | ani1_tag1 |               | HumanObs      | deployment remarks
occ1         | occ1      | ani1_tag1     | MachineObs    |
occ2         | occ2      | ani1_tag1     | MachineObs    |

@sarahcd I'm trying this approach to see how it looks at GBIF. Given that the capture and deployment event are one and the same in this model, I would name it tag deployment rather than tag attachment.

Terms like `deploy-on measurements` written with hyphen in NERC

@sarahcd I noticed that the following (new) terms have labels that are written with hyphen:

This in contrast with previous terms:

The extra hyphen makes it annoying to find the term, since we ignore all -, _, and . to look up a term. Is it possible to change the labels to make them consistent?

License not recognized by GBIF

See gbif/ipt#1778

This is recognized by the IPT editor, but not by other systems:

<intellectualRights>
      <para>Public Domain (CC0 1.0)</para>
      <rights>Creative Commons Zero v1.0 Universal</rights>
      <rightsIdentifier>cc0-1.0</rightsIdentifier>
      <rightsIdentifierScheme>SPDX</rightsIdentifierScheme>
      <rightsUri>https://creativecommons.org/publicdomain/zero/1.0/legalcode</rightsUri>
      <schemeUri>https://spdx.org/licenses/</schemeUri>
</intellectualRights>

Handle missing fields

E.g. animal-reproductive-condition can be mapped, but is not in the source data we have. The function should be to handle that.

Potential fields:

  • ref.animal-reproductive-condition AS reproductiveCondition
  • gps.comments AS eventRemarks
  • ref.tag-model in eventRemarks

Exclude empty coordinates

@sarahcd is it possible for the coordinates of the gps records to be NULL or will those be marked as visible=false?

Important point, thanks for catching this! It is possible to have null (empty) coordinate fields that are not marked as outliers. (Sometimes there are other sensor measurements in the record, e.g., battery charge, that the user might want to retain in some contexts, so we don't automatically flag these.) Ideally this would be a preprocessing step, removing these before subsampling records.

Originally posted by @sarahcd in #7 (comment)

Sampling protocol

@sarahcd what sampling protocol to use:

  • Deployment start: tag deployment start?
  • GPS positions: gps or gps sensor?
  • Deployment end: tag deployment end?

Records outside of deployment

@sarahcd, in 2017 you remarked:

Published datasets in Movebank often include records (rows in the data file) that do not represent a location where an animal was observed:

Pre- or post-deployment records. For these 'individual-local-identifier' is blank. This will only occur rarely; I think none so far include them but we have a dataset in review with data from some undeployed test tags.

I assume all published datasets exclude points outside a deployment? Or should we filter on those in the SQL?

See how IPT handles `userId` in EML

Expected:

<userId directory="http://orcid.org/">0000-0002-8442-8025</userId>

Current output:

<userId>https://orcid.org/0000-0002-8442-8025</userId>

Number of decimals in coordinates

Reviewer of data:

There is also minimal correlation between coordinate precision and cUIM. Four entries with 3 decimal places in
decimalLatitude have cUIM 2.8, 3.5, 4 and 5.9, although the precision in latitude for 3 decimal places is ±55.5 m
(https://en.wikipedia.org/wiki/Decimal_degrees#Precision).
(There are 256 decimalLatitude entries with 15 decimal places, e.g. "53.165282500000004". These have an implied
uncertainty of ±0.055 nanometers, or roughly half an atomic radius, and should have been rounded.)

My answer:

The decimal places of the coordinates are as is, i.e. how these are recorded in the source database and/or recorded by Movebank. Unfortunately that does include false precision (e.g. trailing 000 being dropped or more decimals than reasonable), but we prefer to provide the raw value (as available in the source record), rather than rounding/extending zeros.

@sarahcd @tucotuco what would you suggest? Should we round/extend?

Movebank study ids are not 32 bit integers

In write_dwc there is an assertion

assertthat::assert_that(
    !is.na(as.integer(study_id)),
    msg = glue::glue("`study_id` ({study_id}) must be an integer.")
  )

This is an error because movebank study ids are not necessarily 32 bit integers, eg

https://www.movebank.org/cms/webapp?gwt_fragment=page=studies,path=study2391441038

but 2391441038 is not a 32 bit integer.

Could possibly replace with

assertthat::assert_that(
    regexpr("^\\d+$",as.character(study_id))==1,
    msg = glue::glue("`study_id` ({study_id}) must be an integer.")
  )

Add `study_id` parameter to `write_dwc()`

Indeed, the study URL is pulled from the second alternativeIdentifier in the EML:

study_url <- eml$dataset$alternateIdentifier[[2]]

To allow users to define it, we could:

  • Add a parameter study_id (from which the link can be build)
  • Have users correct add this when they edit metadata in the IPT

Given that the link is used in the description and as external link, I think a parameter would be good?

Originally posted by @peterdesmet in #25 (comment)

Make clickable aphia_id in message

write_dwc() currently returns the found aphia_ids:

library(frictionless)
library(movepub)
p <- read_package("https://zenodo.org/records/10055071/files/datapackage.json")
#> Please make sure you have the right to access data from this Data Package for your intended use.
#> Follow applicable norms or requirements to credit the dataset and its authors.
#> For more information, see https://doi.org/10.5281/zenodo.10055071
dwc <- write_dwc(p, directory = ".")
#> 
#> ── Reading data ──
#> 
#> ℹ Taxa found in reference data and their WoRMS AphiaID:
#> Circus cyaneus: 159372
#> Asio flammeus: 212685
#> Circus pygargus: 1037307
#> Circus aeruginosus: 558541
#> Buteo buteo: 558534
#> 
#> ── Transforming data to Darwin Core ──
#> 
#> ── Writing files ──
#> 
#> • 'eml_path'
#> • 'dwc_occurrence_path'

Created on 2023-12-22 with reprex v2.0.2

It would be nice if those aphia_ids were clickable links, which can be done with:

library(cli)
library(movepub)
taxon <- get_aphia_id("Circus cyaneus")
cli::cli_li("{taxon[['name']]}: {.href [{taxon['aphia_id']}]({taxon['aphia_url']})}.")
#> • Circus cyaneus: 159372
#> (<https://www.marinespecies.org/aphia.php?p=taxdetails&id=159372>).

Created on 2023-12-22 with reprex v2.0.2

But it takes some handling of NA and multiple values that I haven't been able to debug.

Create helper functions for `write_dwc()`

write_dwc() is currently too long and is better split in helper functions. This will also help to call certain functions for certain data (gps vs bird ring etc.)

Still in write_dwc()

  • Read resources as dataframes (we don't want to do this multiple times)
  • Get taxa from ref (we don't want to do this multiple times)
  • Binding the occurrence df from the helper functions
  • Adding the DATASET-LEVEL terms

`dwc_occurrence_ref()

  • File: dwc_occurrence_ref.R
  • Document function, but use @noRd
  • Parameters: ref (dataframe), taxa (dataframe)
  • Applies expand_col() in function, but only with the columns that are used in the mapping
  • Applies dplyr mapping
  • Returns df

`dwc_occurrence_gps()

  • File: dwc_occurrence_gps.R
  • Document function, but use @noRd
  • Parameters: ref (dataframe), taxa (dataframe), gps (dataframe)
  • Applies expand_col() in function, but only with the columns that are used in the mapping
  • Applies dplyr mapping
  • Returns df

Publish first production dataset

@sarahcd @timrobertson100 I now have an operational function to create Darwin Core from a Movebank dataset: https://inbo.github.io/movepub/reference/write_dwc.html

I have used it to create:

I now want to use this to create the first production dataset, but would like some feedback on:

  • Is there sufficient metadata? See #9
  • Are we ok with the structure of the data? See also last remark at #7
  • What DOI to assign: #15

Include `scientificNameID` to make datasets valid for OBIS

This would require a call to WoRMS, with the worrms package. I see no function to get the LSID directly, so it would be something like:

aphia_id <- worrms::wm_name2id(name = "Larus fuscus")
scientific_name_id <- paste("urn:lsid:marinespecies.org:taxname", aphia_id, sep = ":") # but keep empty if aphia_id is empty

Define how to populate metadata

Hi @sarahcd, here's a first attempt at a Movebank dataset on GBIF: https://www.gbif-uat.org/dataset/0ef15f32-b41d-4274-ae96-eb5d0059fee6

Dataset that is included:

  • Basis metadata:
    • Title + [subsampled representation]
    • Language + type + update frequency + publishing org: to be set by user in IPT
    • License: set, but issue #19
    • Description: copied from source dataset, first paragraph added
    • Contact: provided by user or first creator
    • Creators: original dataset authors/creators
    • Metadata provider: same as contact
    • Funding sources: provided as separate paragraph in source dataset and thus copied to IPT
  • Geographic coverage: not set, not directly available in source dataset
  • Taxonomic coverage: could be derived from data, not sure if worth it?
  • Temporal coverage: could be derived from data, not sure if worth it?
  • Keywords: copied from source dataset
  • Associated parties: not set
  • Project data: not set
  • Sampling methods: not set
  • Citations
    • Resource citation: left to automatic one by GBIF
    • Bibliography: not set, could potentially be derived from relatedIdentifiers in DataCite, not sure if worth it? no
  • Collection data: not applicable
  • External links: website set to Movebank Study ID (as a link)
  • Additional metadata:
    • Pub date: set to source dataset but overwritten by IPT
    • Alternative identifier: Movebank Study ID (as a link)
    • Alternative identifier: DOI, but see #15

Populate record level terms from DataCite

  • type
  • license

Could pull from DataCite, e.g. https://api.datacite.org/dois/application/vnd.datacite.datacite+xml/10.5441/001/1.vp4cf4qg:

<rightsList><rights rightsURI="http://creativecommons.org/publicdomain/zero/1.0/">Creative Commons Universal Public Domain Dedication (CC0 1.0)</rights>

Or Movebank REST API, e.g. https://www.movebank.org/movebank/service/direct-read?entity_type=study&study_id=2911040

license_type = CC_0
  • rightsHolder: INBO

Should typically be the institution that collected the data. Not currently represented in DataCite metadata from the repository. In Movebank, I plan to add a study-level attribute that combines institutionID and rightsHolder.

  • datasetID: https://doi.org/10.5281/zenodo.5879096

DOI as a URL, makes sense to always use a PURL and refer to the Movebank study elsewhere

  • institutionCode: MPIAB

institutionCode: = 'custody' of the data. Decided to use 'Max Planck Institute of Animal Behavior'.

  • collectionCode: Movebank

Given example 'ebird', can use 'Movebank'.

  • datasetName: O_WESTERSCHELDE - Eurasian oystercatchers (Haematopus ostralegus, Haematopodidae) breeding in East Flanders (Belgium) [subsampled representation]

Using a suffix to avoid systems like ORCID to consider the datasets as duplicates

  • informationWithheld: see https://doi.org/10.5281/zenodo.5879096 or see metadata

Add tests for `get_aphia_id()`

The helper function get_aphia_id() is currently 1) hidden and 2) has no tests.

  • It might be good to expose it as a public function done in 02333a8
  • Add tests, e.g. for:
    • get_aphia_id("Mola mola") Should return 1 record
    • get_aphia_id(c("Mola mola", "not_a_name")) Should return 2 records, one NA
    • get_aphia_id(c("Mola mola", "?")) ? is handled differently by wm_name2id_(), but should return NA
    • get_aphia_id("Pisces") Unaccepted taxa, should return as normal
    • get_aphia_id(c("not_a_name")) Should return NA, but currently has an error

html code is lost during conversion to eml

In the eml file, the second paragraph of the metadata does not contain html code. So when uploading on the IPT, the links and formatting are not preserved.

The question is: is it worth it to debug, or more time efficient to do this manually?

Order of data

@sarahcd the row order is strange, e.g., below. If I download the csv from Movebank, the records are not in this order, so that shouldn't be the reason.

image (3)

@sarahcd can you clarify what row order you would expect? First by animal_tag, then timestamp?

Create function to write EML

See https://docs.ropensci.org/EML/articles/creating-EML.html

  • alternateIdentifier: see how GBIF handles this #15
  • title: add suffix
  • creator
    • givenName
    • surName
    • organizationName: not intended for affiliation
    • email: don't have this info
    • ORCID: directory not set, see #16
  • metadataProvider
    • givenName
    • surName
    • organizationName: not intended for affiliation
    • email
    • ORCID
  • pubDate
  • language: not set
  • abstract
    • para(s): HTML correctly escaped
  • keywordSet
    • keyword(s)
  • intellectualRights: not read correctly if URL is provided
  • distribution$online$url
  • contact
    • givenName
    • surName
    • organizationName: not intended for affiliation
    • email
    • ORCID

Very precise coordinateUncertaintyInMeters

Reviewer of the data:

Please consider recalculating or re-estimating the entries in the coordinateUncertaintyInMeters field. I would have
expected that a default uncertainty could be provided for a GPS tracking unit (or a handheld GPS). Instead, the entries
are hugely varied and given to 0.1 m, which (again) is unbelievable.

My answer:

Values in this field are based on what is recorded as horizontal precision by the GPS tracker. These records can show outliers. We prefer to provide the raw value (as available in the source record), rather than setting a cut-off (e.g. the default 30m as lower value for GPS).

@sarahcd @tucotuco what would you suggest?

Imprecise values in minimumDistanceAboveSurfaceInMeters

Reviewer of data:

Please consider deleting the minimumDistanceAboveSurfaceInMeters field. In the O_AMELAND dataset the entries
range from -405 (depth) to 6467 (elevation), both of which are unbelievable and make the other entries suspicious.

My answer:

Values in this field are based on what is recorded as height by the GPS tracker. These records can show outliers. We prefer to provide the raw value (as available in the source record), rather than setting a cut-off, which can vary from project to project.

@sarahcd @tucotuco what would you suggest?

DwC mapping of Movebank

Record-level

See #12

Deployment

  • basisOfRecord
  • occurrenceID: don't use deploymentID (not always assigned), use same value as dwc:eventID
  • individualCount: add? no
  • sex: unknown -> undetermined left as unknown
  • lifeStage: map as is? yes
  • occurrenceStatus
  • organismID
  • organismName
  • eventID: use ref."animal-id" || '_' || ref."tag-id" (different order, not a hyphen)
  • samplingProtocol: tag deployment? use tag attachment
  • eventDate
  • eventRemarks: yes, use this since it will show on GBIF. Also include more information:
ref."tag-manufacturer-name" + ref."tag-model" "tag attached to "
CASE
 WHEN ref."manipulation-type" = 'none' THEN 'free-ranging animal'
 WHEN ref."manipulation-type" != 'none' THEN 'manipulated animal (see original dataset)' -- Or exclude these animals from GBIF. 
END
"by" ref."attachment-type"
  • minimumDistanceAboveSurfaceInMeters: NULL
  • decimalLatitude: round to 5? keep as is
  • decimalLongitude: round to 5? keep as is
  • geodeticDatum Make conditional on decimalLatitude
  • coordinateUncertaintyInMeters: 30? Use 1000, make conditional on decimalLatitude
  • taxonID: available? Yes, in database (ITIS TSN), but would have to be retrieved through https://www.movebank.org/movebank/service/direct-read?entity_type=taxon&canonical_name=Phoebastria%20irrorata). However, since scientificNames are already standardized, it's not really worth it
  • scientificName
  • kingdom: set to Animalia yes
  • rank: available? almost always species, but subspecies could happen too

GPS

  • basisOfRecord
  • dataGeneralizations: Example subsampled by hour: first of x records`? ok
  • occurrenceID: prefix with something? movebank:00000 use identifier as is
  • sex: NULL or map? Now mapped
  • lifeStage: NULL, see #11
  • occurrenceStatus
  • organismID
  • organismName
  • eventID: update
  • samplingProtocol: gps."sensor-type"? yes
  • eventDate
  • eventRemarks: NULL
  • minimumDistanceAboveSurfaceInMeters: height-above-msl
  • decimalLatitude: round to 5? keep as is
  • decimalLongitude: round to 5? keep as is
  • geodeticDatum
  • coordinateUncertaintyInMeters
  • scientificName
  • kingdom: set to Animalia yes

Empty fields

Reviewer of data:

The organismName and reproductiveCondition fields are empty and should be deleted.

My answer:

We developed a standardized approach for transforming data from a Movebank study to Darwin Core data (https://inbo.github.io/movepub/reference/write_dwc.html). This approach includes providing fields that might not be available/populated in the source data, such as organismName and reproductiveCondition. We could check and remove empty fields manually for each dataset, but we prefer to use the same (automated) approach for all datasets.

@sarahcd @tucotuco what would you suggest?

Write tests for `write_dwc()`

Similar to camtraptor, this package could benefit from snapshot tests for write_dwc(). In contrast with camtraptor, write_dwc() writes both an eml.xml and dwc_occurrence.csv file. And there is no example dataset included in the package. You can use this dataset though:

o_assen <-
  frictionless::read_package("https://zenodo.org/record/5653311/files/datapackage.json") %>%
  # Remove the large acceleration resource we won't use (and thus won't download)
  frictionless::remove_resource("acceleration") %>%
  frictionless::write_package()

It can potentially be reduced to fewer animals.

  • Write helper function to download example dataset (so it can be used locally)
    • Potentially reduce reference-data and gps data to a set of 3-4 animal-ids to reduce size
    • Downloaded files should be removed after tests
    • If the function is called directly from tests (rather assigned to a variable at the beginning of the tests), it should check if the local files are available?
  • Test expected Darwin Core mapping
  • Test expected EML file (note that there is a UUID that is expected to change)

@PietrH would you be willing to tackle this?

What reference data to include

@sarahcd so far I have based the mapping for the start and end HumanObservations on fields I have in my datasets. It would be good however to have a mapping based on all fields potentially available in the reference data. Can you check the field where you agree that they are not useful to map (crossed out) and comment on those you think would be useful to include?

  • animal-comments: start occurrenceRemarks?
  • animal-death-comments: end
  • animal-exact-date-of-birth
  • animal-id: organismID
  • animal-latest-date-born
  • animal-life-stage: lifeStage
  • animal-mass
  • animal-nickname: organismName
  • animal-reproductive-condition: reproductiveCondition
  • animal-ring-id: include, where?
  • animal-sex: sex
  • animal-taxon: scientificName
  • animal-taxon-detail
  • attachment-type: included in start eventRemarks
  • behavior-according-to
  • data-processing-software
  • deploy-off-date: end
  • deploy-off-latitude: end
  • deploy-off-longitude: end
  • deploy-off-person: end
  • deploy-on-date: start eventDate
  • deploy-on-latitude: start decimalLatitude
  • deploy-on-longitude: start decimalLongitude
  • deploy-on-person: start recordedBy?
  • deployment-comments: included in start eventRemarks
  • deployment-end-comments: end
  • deployment-end-type: end
  • deployment-id
  • duty-cycle
  • geolocator-calibration
  • geolocator-light-threshold
  • geolocator-sensor-comments
  • geolocator-sun-elevation-angle
  • habitat-according-to
  • location-accuracy-comments
  • manipulation-comments
  • manipulation-type: included in start eventRemarks
  • study-site
  • tag-beacon-frequency
  • tag-comments
  • tag-failure-comments
  • tag-id: part of eventID
  • tag-manufacturer-name: included in start eventRemarks
  • tag-mass
  • tag-model: included in start eventRemarks
  • tag-processing-type
  • tag-production-date
  • tag-readout-method
  • tag-serial-no

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.