inbo / movepub Goto Github PK
View Code? Open in Web Editor NEWR package to prepare animal tracking data from Movebank for publication in a research repository or GBIF
Home Page: https://inbo.github.io/movepub/
License: Other
R package to prepare animal tracking data from Movebank for publication in a research repository or GBIF
Home Page: https://inbo.github.io/movepub/
License: Other
Currently, users of movepub have to load the frictionless package to work with datasets (see https://inbo.github.io/movepub/articles/movepub.html).
It might be useful if the functions read_package()
and write_package()
are copied from frictionless and exposed through movepub, cf. how readr::problems()
is exposed in camtraptor. That also avoids the name clash warning between frictionless::add_resource()
and movepub::add_resource()
.
Add an intro paragraph to the dataset:
This animal tracking dataset is derived from Spanoghe et al. (2022, https://doi.org/10.5281/zenodo.5879096), a deposit of Movebank study ID 1099562810. Data have been standardized to Darwin Core using the movepub R package and are downsampled to the first GPS position per hour. The original dataset description follows.
Note: The write_movebank_dwca()
function is to perfect place to edit the eml and write it to the directory too.
http://vocab.nerc.ac.uk/collection/MVB/current/MVB000055/ currently has the altLabel bar:barometric pressure
. That makes it harder to parse with get_movebank_term()
. @sarahcd will update this in a next release to:
bar barometric pressure
write_dwc() currently uses sqlite for the transformation. This was chosen so that the transformation itself could be written in SQL, which is more universally understood. Those SQL files are referenced in the function documentation: https://inbo.github.io/movepub/reference/write_dwc.html#data
Annoyingly, this adds two dependencies (RSQLite, DBI) and it is currently affected by a change in glue: #59. Just like in camtraptor/camtrapdp (see inbo/camtraptor#207), I suggest to move the transformation from SQL to dplyr::mutate()
. dplyr is already a dependency.
Todo:
dplyr::mutate()
write_dwc()
and write_eml()
, see #57Provide alternative labels for (official first, alternative second):
magnetic field raw x
:mag magnetic field raw x
magnetic field raw y
: mag magnetic field raw y
magnetic field raw z
: mag magnetic field raw z
light level
: gls light level
Ornitela transmission protocol
: orn transmission protocol
study time zone
study timezone
is now the official term http://vocab.nerc.ac.uk/collection/MVB/current/MVB000177/local timestamp
study local timestamp
is not the official term http://vocab.nerc.ac.uk/collection/MVB/current/MVB000140/Original comment:
These fields are only defined without the mag
part in the Movebank attribute dictionary (http://vocab.nerc.ac.uk/collection/MVB/current/MVB000151/), in contrast with e.g.
gps:hdop
: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000118/eobs:key-bin-checksum
: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000095/bar:barometric-pressure
: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000055/ (alt label)Ideally, write_dwc()
can be used in an operational pipeline and doesn't generate errors. Currently, that is not the case, since it requires e.g. a rightsholder, contact person, etc. Does properties are actually required for the EML part, not the Darwin Core transformations.
... One additional request is for the messages shown as the code is executed. Where messages say 'Please make sure you have the right to access data....', please somewhere refer to 'Movebank’s terms of use at https://www.movebank.org/cms/movebank-content/general-movebank-terms-of-use'. If you point me to the file/s with the message text I can suggest it there directly.
In a future update of the Movebank Attribute Dictionary, data type
will be indicated in the definition. Rather than letting create_schema()
guess the types, it would be better if those are retrieved from the MVB Dictionary.
Comment by @sarahcd:
gps."height-above-ellipsoid" AS minimumDistanceAboveSurfaceInMeters
Some datasets have 'height-above-ellipsoid'. The datum is different but both fit the definition of minimumDistanceAboveSurfaceInMeters.
Is it possible to map conditionally depending on which attribute is present?
Rarely, both attributes are present. And there is also 'height-raw', but this does not define units so should be ignored.
@sarahcd do you have a good example dataset in the Movebank repository that I can use to test if the write_dwc()
function handles it elegantly?
Has something changed in frictionless
?
1: package
must be a Data Package object.
✖ package
is missing a "datapackage" class.
ℹ Create a valid Data Package object with read_package()
or create_package()
.
Reviewer of data:
65 of the entries in the O_AMELAND dataset are "999998.9", or 1000 km, and indicate that the coordinates for these
entries cannot be trusted. If that cUIM is plausible, those 65 records should be discarded.
My answer:
Before records are uploaded to Movebank (from which they are derived for GBIF) we calculate whether or not they constitute an outlier based on speed and angle (https://github.com/inbo/bird-tracking/blob/master/src/outliers.Rmd#L55), not (potentially imprecise) height and precision values. We exclude outliers when publishing to GBIF, we prefer not to exclude records based on the precision.
Update http to https in:
Line 146 in afffbde
Once gbif/ipt#1819 is included in the next version of the IPT.
See http://vocab.nerc.ac.uk/collection/MVB/current/MVB000126/:
An integer proportional to the GSM signal strength measured by the tag. Valid values are 0�31, or 99 for unknown or undetectable. Higher values indicate better signal strength. Example: '39'; Units: arbitrary strength units (asu); Entity described: event
The setup suggested in #7 has a grouping event (deployment) and occurrences (1 tag attachment, 0-more gps positions). In line with the new model, @timrobertson100 suggests to have a separate event for all occurrences (basically eventID = occurrenceID), and to group all GPS positions under the deployment using parentEventID
. The deployment itself also has occurrence attached: the tag attachment. See table:
occurrenceID | eventID | parentEventID | basisOfRecord | eventRemarks
------------ | --------- | ------------- | ------------- | ------------------
ani1_tag1 | ani1_tag1 | | HumanObs | deployment remarks
occ1 | occ1 | ani1_tag1 | MachineObs |
occ2 | occ2 | ani1_tag1 | MachineObs |
@sarahcd I'm trying this approach to see how it looks at GBIF. Given that the capture and deployment event are one and the same in this model, I would name it tag deployment
rather than tag attachment
.
@sarahcd I noticed that the following (new) terms have labels that are written with hyphen:
deploy-off measurements
: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000354/deploy-off sampling
: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000355/deploy-on measurements
: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000356/deploy-on sampling
: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000357/GSM MCC-MNC
: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000125/animal birth-hatch latitude
: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000338/animal birth-hatch longitude
: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000339/study-specific measurement
: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000178/This in contrast with previous terms:
deploy off timestamp
, etc.: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000077/deploy on timestamp
, etc: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000081/animal death comments
, etc.: http://vocab.nerc.ac.uk/collection/MVB/current/MVB000013/The extra hyphen makes it annoying to find the term, since we ignore all -
, _
, and .
to look up a term. Is it possible to change the labels to make them consistent?
See gbif/ipt#1778
This is recognized by the IPT editor, but not by other systems:
<intellectualRights>
<para>Public Domain (CC0 1.0)</para>
<rights>Creative Commons Zero v1.0 Universal</rights>
<rightsIdentifier>cc0-1.0</rightsIdentifier>
<rightsIdentifierScheme>SPDX</rightsIdentifierScheme>
<rightsUri>https://creativecommons.org/publicdomain/zero/1.0/legalcode</rightsUri>
<schemeUri>https://spdx.org/licenses/</schemeUri>
</intellectualRights>
Me:
Should the source DOI be an alternative identifier for the dataset in EML?
I think so. I’m not sure if GBIF will then use it though… can you try perhaps?
E.g. animal-reproductive-condition
can be mapped, but is not in the source data we have. The function should be to handle that.
Potential fields:
ref.animal-reproductive-condition
AS reproductiveConditiongps.comments
AS eventRemarksref.tag-model
in eventRemarks@sarahcd is it possible for the coordinates of the gps records to be NULL or will those be marked as
visible=false
?
Important point, thanks for catching this! It is possible to have null (empty) coordinate fields that are not marked as outliers. (Sometimes there are other sensor measurements in the record, e.g., battery charge, that the user might want to retain in some contexts, so we don't automatically flag these.) Ideally this would be a preprocessing step, removing these before subsampling records.
Originally posted by @sarahcd in #7 (comment)
@sarahcd what sampling protocol to use:
tag deployment start
?gps
or gps sensor
?tag deployment end
?@sarahcd, in 2017 you remarked:
Published datasets in Movebank often include records (rows in the data file) that do not represent a location where an animal was observed:
Pre- or post-deployment records. For these 'individual-local-identifier' is blank. This will only occur rarely; I think none so far include them but we have a dataset in review with data from some undeployed test tags.
I assume all published datasets exclude points outside a deployment? Or should we filter on those in the SQL?
I currently used a hack to circumvent ropensci/EML#342 Would be nice if I didn't have to:
Lines 64 to 65 in ad44aaa
Expected:
<userId directory="http://orcid.org/">0000-0002-8442-8025</userId>
Current output:
<userId>https://orcid.org/0000-0002-8442-8025</userId>
Reviewer of data:
There is also minimal correlation between coordinate precision and cUIM. Four entries with 3 decimal places in
decimalLatitude have cUIM 2.8, 3.5, 4 and 5.9, although the precision in latitude for 3 decimal places is ±55.5 m
(https://en.wikipedia.org/wiki/Decimal_degrees#Precision).
(There are 256 decimalLatitude entries with 15 decimal places, e.g. "53.165282500000004". These have an implied
uncertainty of ±0.055 nanometers, or roughly half an atomic radius, and should have been rounded.)
My answer:
The decimal places of the coordinates are as is, i.e. how these are recorded in the source database and/or recorded by Movebank. Unfortunately that does include false precision (e.g. trailing 000 being dropped or more decimals than reasonable), but we prefer to provide the raw value (as available in the source record), rather than rounding/extending zeros.
@sarahcd @tucotuco what would you suggest? Should we round/extend?
In write_dwc there is an assertion
assertthat::assert_that(
!is.na(as.integer(study_id)),
msg = glue::glue("`study_id` ({study_id}) must be an integer.")
)
This is an error because movebank study ids are not necessarily 32 bit integers, eg
https://www.movebank.org/cms/webapp?gwt_fragment=page=studies,path=study2391441038
but 2391441038 is not a 32 bit integer.
Could possibly replace with
assertthat::assert_that(
regexpr("^\\d+$",as.character(study_id))==1,
msg = glue::glue("`study_id` ({study_id}) must be an integer.")
)
Indeed, the study URL is pulled from the second alternativeIdentifier
in the EML:
Line 96 in d9f1a2d
To allow users to define it, we could:
study_id
(from which the link can be build)Given that the link is used in the description and as external link, I think a parameter would be good?
Originally posted by @peterdesmet in #25 (comment)
write_dwc()
currently returns the found aphia_ids:
library(frictionless)
library(movepub)
p <- read_package("https://zenodo.org/records/10055071/files/datapackage.json")
#> Please make sure you have the right to access data from this Data Package for your intended use.
#> Follow applicable norms or requirements to credit the dataset and its authors.
#> For more information, see https://doi.org/10.5281/zenodo.10055071
dwc <- write_dwc(p, directory = ".")
#>
#> ── Reading data ──
#>
#> ℹ Taxa found in reference data and their WoRMS AphiaID:
#> Circus cyaneus: 159372
#> Asio flammeus: 212685
#> Circus pygargus: 1037307
#> Circus aeruginosus: 558541
#> Buteo buteo: 558534
#>
#> ── Transforming data to Darwin Core ──
#>
#> ── Writing files ──
#>
#> • 'eml_path'
#> • 'dwc_occurrence_path'
Created on 2023-12-22 with reprex v2.0.2
It would be nice if those aphia_ids were clickable links, which can be done with:
library(cli)
library(movepub)
taxon <- get_aphia_id("Circus cyaneus")
cli::cli_li("{taxon[['name']]}: {.href [{taxon['aphia_id']}]({taxon['aphia_url']})}.")
#> • Circus cyaneus: 159372
#> (<https://www.marinespecies.org/aphia.php?p=taxdetails&id=159372>).
Created on 2023-12-22 with reprex v2.0.2
But it takes some handling of NA
and multiple values that I haven't been able to debug.
write_dwc()
is currently too long and is better split in helper functions. This will also help to call certain functions for certain data (gps vs bird ring etc.)
dwc_occurrence_ref.R
@noRd
expand_col()
in function, but only with the columns that are used in the mappingdwc_occurrence_gps.R
@noRd
expand_col()
in function, but only with the columns that are used in the mappingIn inbo/bird-tracking#197 a couple new metadata fields were added:
alt-project-id
(new field): no need to mapdeploy-on-measurements
(new field): could be mapped to dynamicProperties
deployment-comments
: now contains reduced contentstudy-site
(release_location): could be mapped to locality
for human obstag-firmware
: no need to map@sarahcd @timrobertson100 I now have an operational function to create Darwin Core from a Movebank dataset: https://inbo.github.io/movepub/reference/write_dwc.html
I have used it to create:
I now want to use this to create the first production dataset, but would like some feedback on:
This would require a call to WoRMS, with the worrms
package. I see no function to get the LSID directly, so it would be something like:
aphia_id <- worrms::wm_name2id(name = "Larus fuscus")
scientific_name_id <- paste("urn:lsid:marinespecies.org:taxname", aphia_id, sep = ":") # but keep empty if aphia_id is empty
Hi @sarahcd, here's a first attempt at a Movebank dataset on GBIF: https://www.gbif-uat.org/dataset/0ef15f32-b41d-4274-ae96-eb5d0059fee6
Dataset that is included:
[subsampled representation]
See if https://bartk.gitlab.io/move2/reference/movebank_get_vocabulary.html can replace this function.
Could pull from DataCite, e.g. https://api.datacite.org/dois/application/vnd.datacite.datacite+xml/10.5441/001/1.vp4cf4qg:
<rightsList><rights rightsURI="http://creativecommons.org/publicdomain/zero/1.0/">Creative Commons Universal Public Domain Dedication (CC0 1.0)</rights>
Or Movebank REST API, e.g. https://www.movebank.org/movebank/service/direct-read?entity_type=study&study_id=2911040
license_type = CC_0
INBO
Should typically be the institution that collected the data. Not currently represented in DataCite metadata from the repository. In Movebank, I plan to add a study-level attribute that combines institutionID and rightsHolder.
https://doi.org/10.5281/zenodo.5879096
DOI as a URL, makes sense to always use a PURL and refer to the Movebank study elsewhere
MPIAB
institutionCode: = 'custody' of the data. Decided to use 'Max Planck Institute of Animal Behavior'.
Movebank
Given example 'ebird', can use 'Movebank'.
O_WESTERSCHELDE - Eurasian oystercatchers (Haematopus ostralegus, Haematopodidae) breeding in East Flanders (Belgium) [subsampled representation]
Using a suffix to avoid systems like ORCID to consider the datasets as duplicates
see https://doi.org/10.5281/zenodo.5879096
or see metadata
The helper function get_aphia_id()
is currently 1) hidden and 2) has no tests.
get_aphia_id("Mola mola")
Should return 1 recordget_aphia_id(c("Mola mola", "not_a_name"))
Should return 2 records, one NAget_aphia_id(c("Mola mola", "?"))
? is handled differently by wm_name2id_()
, but should return NAget_aphia_id("Pisces")
Unaccepted taxa, should return as normalget_aphia_id(c("not_a_name"))
Should return NA, but currently has an errorIn the eml file, the second paragraph of the metadata does not contain html code. So when uploading on the IPT, the links and formatting are not preserved.
The question is: is it worth it to debug, or more time efficient to do this manually?
See https://docs.ropensci.org/EML/articles/creating-EML.html
glue 1.7.0 includes some changes to glue::glue_sql()
which has introduced a bug in write_dwc()
:
Error: Expected string vector of length 1
Reviewer of the data:
Please consider recalculating or re-estimating the entries in the coordinateUncertaintyInMeters field. I would have
expected that a default uncertainty could be provided for a GPS tracking unit (or a handheld GPS). Instead, the entries
are hugely varied and given to 0.1 m, which (again) is unbelievable.
My answer:
Values in this field are based on what is recorded as horizontal precision by the GPS tracker. These records can show outliers. We prefer to provide the raw value (as available in the source record), rather than setting a cut-off (e.g. the default 30m as lower value for GPS).
The links in vignettes/movepub.Rmd are broken. Sarah provided new links: #49 (comment)
Reviewer of data:
Please consider deleting the minimumDistanceAboveSurfaceInMeters field. In the O_AMELAND dataset the entries
range from -405 (depth) to 6467 (elevation), both of which are unbelievable and make the other entries suspicious.
My answer:
Values in this field are based on what is recorded as height by the GPS tracker. These records can show outliers. We prefer to provide the raw value (as available in the source record), rather than setting a cut-off, which can vary from project to project.
See #12
dwc:eventID
ref."animal-id" || '_' || ref."tag-id"
(different order, not a hyphen)tag deployment
? use tag attachment
ref."tag-manufacturer-name" + ref."tag-model" "tag attached to "
CASE
WHEN ref."manipulation-type" = 'none' THEN 'free-ranging animal'
WHEN ref."manipulation-type" != 'none' THEN 'manipulated animal (see original dataset)' -- Or exclude these animals from GBIF.
END
"by" ref."attachment-type"
Animalia
yesExample
subsampled by hour: first of x records`? okmovebank:00000
use identifier as isAnimalia
yesReviewer of data:
The organismName and reproductiveCondition fields are empty and should be deleted.
My answer:
We developed a standardized approach for transforming data from a Movebank study to Darwin Core data (https://inbo.github.io/movepub/reference/write_dwc.html). This approach includes providing fields that might not be available/populated in the source data, such as organismName and reproductiveCondition. We could check and remove empty fields manually for each dataset, but we prefer to use the same (automated) approach for all datasets.
@sarahcd sex
is now populated for the GPS records. Should we set lifeStage
for those records to adult
if the animal was an adult when captured?
Originally posted by @peterdesmet in #7 (comment)
Similar to camtraptor, this package could benefit from snapshot tests for write_dwc()
. In contrast with camtraptor, write_dwc()
writes both an eml.xml
and dwc_occurrence.csv
file. And there is no example dataset included in the package. You can use this dataset though:
o_assen <-
frictionless::read_package("https://zenodo.org/record/5653311/files/datapackage.json") %>%
# Remove the large acceleration resource we won't use (and thus won't download)
frictionless::remove_resource("acceleration") %>%
frictionless::write_package()
It can potentially be reduced to fewer animals.
reference-data
and gps
data to a set of 3-4 animal-ids to reduce size@PietrH would you be willing to tackle this?
@sarahcd so far I have based the mapping for the start and end HumanObservations on fields I have in my datasets. It would be good however to have a mapping based on all fields potentially available in the reference data. Can you check the field where you agree that they are not useful to map (crossed out) and comment on those you think would be useful to include?
animal-comments
: start occurrenceRemarks?animal-death-comments
animal-exact-date-of-birth
animal-id
: organismIDanimal-latest-date-born
animal-life-stage
: lifeStageanimal-mass
animal-nickname
: organismNameanimal-reproductive-condition
: reproductiveConditionanimal-ring-id
: include, where?animal-sex
: sexanimal-taxon
: scientificNameanimal-taxon-detail
attachment-type
: included in start eventRemarksbehavior-according-to
data-processing-software
deploy-off-date
deploy-off-latitude
deploy-off-longitude
deploy-off-person
deploy-on-date
: start eventDatedeploy-on-latitude
: start decimalLatitudedeploy-on-longitude
: start decimalLongitudedeploy-on-person
: start recordedBy?deployment-comments
: included in start eventRemarksdeployment-end-comments
deployment-end-type
deployment-id
duty-cycle
geolocator-calibration
geolocator-light-threshold
geolocator-sensor-comments
geolocator-sun-elevation-angle
habitat-according-to
location-accuracy-comments
manipulation-comments
manipulation-type
: included in start eventRemarksstudy-site
tag-beacon-frequency
tag-comments
tag-failure-comments
tag-id
: part of eventID
tag-manufacturer-name
: included in start eventRemarkstag-mass
tag-model
: included in start eventRemarkstag-processing-type
tag-production-date
tag-readout-method
tag-serial-no
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.