Giter Site home page Giter Site logo

camtrapdp's Introduction

camtrapdp

CRAN status CRAN checks R-CMD-check codecov repo status DOI

Camtrapdp is an R package to read and manipulate Camera Trap Data Packages (Camtrap DP). Camtrap DP is a data exchange format for camera trap data. With camtrapdp you can read, filter and transform data (including to Darwin Core) before further analysis in e.g. camtraptor or camtrapR.

To get started, see:

Installation

Install the latest released version from CRAN:

install.packages("camtrapdp")

Or the development version from GitHub:

# install.packages("devtools")
devtools::install_github("inbo/camtrapdp")

Usage

With camtrapdp you can read a Camtrap DP dataset into your R environment:

library(camtrapdp)

file <- "https://raw.githubusercontent.com/tdwg/camtrap-dp/1.0/example/datapackage.json"
x <- read_camtrapdp(file)
x
#> A Data Package with 4 resources:
#> • deployments
#> • media
#> • observations
#> • individuals
#> Use `unclass()` to print the Data Package as a list.

read_camtrapdp() will automatically convert an older version of Camtrap DP to the latest version. It will also make the data easier to use, by assigning taxonomic information (found in the metadata) to the observations and eventIDs (found in the observations) to the media.

To access the data, use one of the accessor functions like locations():

locations(x)
#> # A tibble: 4 × 5
#>   locationID locationName               latitude longitude coordinateUncertainty
#>   <chr>      <chr>                         <dbl>     <dbl>                 <dbl>
#> 1 e254a13c   B_HS_val 2_processiepark       51.5      4.77                   187
#> 2 2df5259b   B_DL_val 5_beek kleine vi…     51.2      5.66                   187
#> 3 ff1535c0   B_DL_val 3_dikke boom          51.2      5.66                   187
#> 4 ce943ced   B_DM_val 4_'t WAD              50.7      4.01                   187

You can also filter data with one of the filter functions, which automatically filter the related data. For example, here we filter observations on scientific name(s) and return the associated events in that subset:

x %>%
  filter_observations(
    scientificName %in% c("Martes foina", "Mustela putorius")
  ) %>%
  events()
#> # A tibble: 4 × 4
#>   deploymentID eventID  eventStart          eventEnd           
#>   <chr>        <chr>    <dttm>              <dttm>             
#> 1 577b543a     976129e2 2020-06-19 22:31:51 2020-06-19 22:31:56
#> 2 577b543a     b4b39b00 2020-06-23 23:33:53 2020-06-23 23:33:58
#> 3 577b543a     5be4f4ed 2020-06-28 22:01:12 2020-06-28 22:01:18
#> 4 577b543a     a60816f2 2020-06-28 23:33:16 2020-06-28 23:33:22

For more functionality, see the function reference.

Meta

camtrapdp's People

Contributors

peterdesmet avatar pietrh avatar damianooldoni avatar sangovae avatar

Stargazers

 avatar

Watchers

 avatar Tom De Boeck avatar Bert H avatar Gert Van Spaendonk avatar Wim De Potter avatar  avatar

camtrapdp's Issues

Create `events()`

Suggested in camtraptor July 2023 coding sprint

Two suggested convenience functions similar to inbo/camtraptor#232 that are nice to have locations() and events().

events()

  • Get unique events from observations.
  • Function uses the following field for distinct: eventID, if emptyand duration of the event (eventStart + eventEnd)
  • Returns:
eventID
eventStart
eventEnd

Warn or skip when media `accessURI` is local URL

In the example package, some of the media are local. As a result accessURI contains media/20210531082538-RCNX0031.JPG. Maybe we should warn for those, or set serviceExpectation to something other than online.

Create `check_camtrapdp()` function

Checks if an object is indeed a camtrapdp object. Inspiration: https://github.com/frictionlessdata/frictionless-r/blob/main/R/check_package.R

  • Does it have the class c("camtrapdp" "list")
  • Call the function from observations(), deployments(), media()
  • Is the used version supported? This functionality could be provided by having a version attribute
  • Also call check_package(), see #33
#> check(x = my_camtrap_dp, version = "1.0")
error: x is does not have the expected version 1.0

Create `camtrapdp.print()` function

Suggested in camtraptor July 2023 coding sprint

The package object is a long list (frictionless object) that isn't intuitive when printed to the console. A summary() function would be really helpful to provide the most important aspects of a package:

  • Function should be called automatically when package is returned to the user (cf. bioRad)
  • For now, the function should return the following info (this can be extended when needed)
deployments: 4
media: 140
observations: 28
  • The summary info could be calculated on the fly (calculate_summary()) or we store that info in package$summary.
  • The function should be used as the source for package parameter description, which other functions can call via inheritParams

Camtrap DP object

I think the camtrapdp package should return a dedicated camtrapdp object, rather than package (see #18). It is very similar to package as returned by frictionless::read_package() but also contains dataframes in object$data and maybe some custom attributes.

Name

The name of the object could be camtrapdp, camtrap_dp, dataset or package. The same name is ideally used for a custom class (cf. tbl_df), parameter names and documentation:

camtrapdp

  • Object: camtrapdp
  • Class: c("camtrapdp", "list")
  • Parameter: deployments(camtrapdp)
  • Documentation: Derived from camtrapdp$taxonomy
  • Remark: Short, same as package name (could be good or bad).

camtrap_dp

  • Object: camtrap_dp
  • Class: c("camtrap_dp", "list") (cf. tbl_df)
  • Parameter: deployments(camtrap_dp)
  • Documentation: Derived from camtrap_dp$taxonomy
  • Remark: Follows snake naming convention.

dataset

  • Object: dataset
  • Class: c("camtrap_dp", "list") Would not use dataset here.
  • Parameter: deployments(dataset)
  • Documentation: Derived from dataset$taxonomy
  • Remark: Too generic?

package

  • Object: package, even though it is a package with attached data, so more than a frictionless::package object
  • Class: c("camtrap_dp", "list") Would not use package here.
  • Parameter: deployments(package)
  • Documentation: Derived from package$taxonomy
  • Remark: Can be confused with R pkg. Indicates that it's (still) a frictionless::package, but that might be not be necessary to know for most users.

camtrap_dp looks like the most logical one, but overusing the Camtrap DP/camtrapdp/camtrap-dp/camtrap_dp name can be confusing too.

Attributes

I'm tempted to add custom attributes to the class. Even though the Camtrap DP version can be derived from object$profile, I think it would be good to have it in a dedicated attribute:

#> attr(dataset, "version")
"1.0"

@PietrH @damianooldoni @sangovae thoughts?

`type` for camera trap images

How to populate dwc:type for a camera trap observation?

I thought Event, as that is basically what the observation is, but @tucotuco argues for StillImage since the image is the source of the observation. Some observations are based on video files (MovingImages), others on a series of still images (sequence, still StillImage).

I don't think we should make the distinction between videos, sequences or photos and would suggest to use the fixed (higher) term Image for type. @tucotuco, does that make sense?

Add `convert_to` argument to `read_camtrap_dp()`

Scenario:

  1. A new version of Camtrap DP comes out (1.1)
  2. camtrapdp::read_camtrap_dp() is updated to support this latest version and upconverts 1.0 to 1.1
  3. camtraptor, which uses camtrapdp as a dependency now breaks, because read_camtrap_dp() no longer returns the version (1.0) camtraptor expects internally

@PietrH @damianooldoni I assume camtraptor can control this by defining camtrapdp == <working version> in its DESCRIPTION? Is this annoying for users install the package?

Alternatively, read_camtrap_dp(file, convert_to = "1.0") has a parameter convert_to so other packages can control to which version read_camtrap_dp() is allowed to convert?

Create `validate()` and helper functions to validate integrity of a Camtrap DP

Suggested in camtraptor July 2023 coding sprint

An important aspect before analysing or publishing data is to check whether the dataset does not contain any major integrity errors, such as missing dates, coordinates, values not meeting controlled vocabularies or relationships between tables not being correct. Although validation is possible with the Python software Frictionless Framework, for most users the returned error messages are hard to parse.

  • camtraptor (or the frictionless R package) could offer some basic data validation (easier to implement than the entire metadata and data validation Frictionless Framework offers).
  • Users can correct issues by retrieving data (#232), correct errors and updating the package (#248).
  • A user facing validate() function could make use of a number of check_ helper functions. Those helper functions could also be run by other functions, e.g. when updating data (#248).

Suggestions for functions:

  • validate(package)
  • check_relations(package): relationships are valid
  • check_identifiers(package, "table name"): IDs are unique
  • check_required(package, "table name"): required fields are populated
  • check_vocabularies(package, "table name"): values meet factor levels. Note that read_resource()/readr() converts these to factors and might throw problems()
  • check_data_types(package, "table name"): note that read_resource()/readr() will throw problems() but otherwise will do a best attempt at converting
  • check_timestamps(package, table name"): has timezone, start <= end (specific to camtraptor, not a frictionless thing)
  • check_durations(package): obs & media timestamps within deployment (specific to camtraptor not a frictionless thing)

While it would be useful if these were functions of the frictionless R package, it might not be what we expect for camtraptor. Frictionless would have its validation run on resources (i.e. csv files + schemas), since returned data frames lose the connection with their schema, so it is not possible to validate for relationships or unique, as that information is lost. Camtraptor on the other hand, wants to validate the (already read) data frames.

Transform license `name` to a URL

write_dwc() currently only considers the license name (since that is the property that will be set by the IPT) and doesn't recode this to any value like a URL. Maybe it should?

set minimum R version explicitly

If we want to support older R versions as R4.0, we should avoid things like native pipes |> and the new shorthand for anonymous functions \(arg)

What R version do we still want to support?

  • R3.5.0 (April 2018) is the current minimum for camtraptor
  • R3.6.3 is the latest version (Feb 2020) before R4.0.0

There is probably a usethis function to set it, but I can't find it atm.

Attach data to `resources[1]$data`

Quick note to self: for higher compatibility with frictionless-r, it would be better to store the dataframes in package$resources[x]$data, rather than package$data. I know we just moved it to data, but the suggestion here would mean that write_package() would work out of the box. Food for thought.

Release camtrapdp 0.2.0

First release:

Prepare for release:

  • git pull
  • urlchecker::url_check()
  • devtools::build_readme()
  • devtools::check(remote = TRUE, manual = TRUE)
  • git push

Submit to CRAN:

  • usethis::use_version('minor')
  • Update NEWS.md
  • devtools::check_win_devel()
  • codemetar::write_codemeta()
  • Merge #65
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • Finish & publish blog post
  • Add link to blog post in pkgdown news menu
  • Add badges to README:
[![CRAN status](https://www.r-pkg.org/badges/version/camtraptor)](https://CRAN.R-project.org/package=camtrapdp)
[![CRAN checks](https://badges.cranchecks.info/worst/camtrapdp.svg)](https://cran.r-project.org/web/checks/check_results_camtrapdp.html)
  • usethis::use_github_release()
  • usethis::use_dev_version(push = TRUE)
  • Tweet

Will `write_eml()` be a camtrapdp function?

Hi! While working on camtraptor version 1.0, I am going to remove write_dwc() (and associated file write_dwc.R) in camtraptor as this functionality has been transferred to camtrapdp, see #55.

Questions:

  1. Can I remove function from camtraptor as well?
  2. Will write_eml() be implemented in camtrapdp as well?

Create `merge()` to merge datasets

Note by @peterdesmet: see #75 (comment) for recent thoughts.


Quite often we want to merge the exports of several research projects (more and more for multi-site research), all using Camtrap-DP as export format, mostly all resulting from agouti.
Though manually this can be done by using several times read_camtrap_dp and subsequently merging the different resulting csv-files, I was wondering if it would not be better to add a function to camtraptor that automatically

  • merges different data packages resulting from read_camtrap_dp
  • adds a column to the deployments.csv using the project_name as variable
  • adds a column to the deployments.csv using the project location / study site
  • the function name could be "join_exports" or something similar but this I leave up to Peter :-)

Update `read_camtrap_dp()` to support Camtrap DP v1.0 onwards with `convert_` functions

Suggested in camtraptor July 2023 coding sprint

See inbo/camtraptor#251 for background. We plan to support all legacy versions of Camtrap DP from version 1.0 onwards.

This will be done by read_camtrap_dp() which can detect the used version and run the profile and data through a number of convert_ functions to get it to the latest versions, e.g.:

# package == v1.0 and since then there is a version 1.1, 1.2
package <- convert_1.0_to_1.1(package)
package <- convert_1.1_to_1.2(package)
package <- convert_1.2_to_1.3(package)

Working sequentially avoids us from having to repeat the same conversions over and over again. It also makes it easier to notice what has changed between versions.

If a certain step is more cumbersome to do via an intermediate version, it is possible to create a conversion function that skips a number of versions, e.g. convert_1.0_to_1.2(). The existing number of convert functions would then be:

# convert_1.0_to_1.1() deleted
convert_1.0_to_1.2()
convert_1.2_to_1.3()

This specific conversion functions could be all handled by a generic

convert(package)

This function would:

  • Get the dataframes and attach these to $data
  • Read the version
  • Call the required number of convert_ functions on the data frames and profile (metadata)
  • Return the package

Create `locations()`

Suggested in camtraptor July 2023 coding sprint

Two suggested convenience functions similar to inbo/camtraptor#232 that are nice to have locations() and events().

locations()

  • Get unique locations from deployments.
  • Function uses the following field for distinct: locationID, if empty locationName, if empty coordinates
  • Returns:
locationID
locationName
latitude
longitude
coordinateUncertainty

create helper `build_taxonomy()`

We want to connect taxonomic information to the observations.csv data. To do this, we'd like to use a helper that parses the taxonomic information from datapackage.json

  • Output should be a tibble with one row per species in datapackage.json
  • Output should have a column for every available field in x$taxonomic, even if only used in one species. So if one species has a polish name, polish name should be a column, that is NA for every record but one.

Update `write_dwc()` to Camtrap DP 1.0

Also describe mapping and provide and example, cf. https://inbo.github.io/movepub/reference/write_dwc.html#data

  • Update dwc_audubon.csv to dwc_audiovisual.csv. Also update in eml.xml, functions, tests and documentation.
  • identificationRemarks: remove mapping for featureType none and other
  • license is currently only set based on path, but would be better derived from name (see tdwg/camtrap-dp#344)
  • type: add MovingImage (see #7)
  • rightsHolder: rightsHolder from contributors (select first)
  • collectionCode: from sources (select first)
  • eventID: eventID
  • eventDate: date range (see #13)
  • serviceExpectation: enable (see #10)
  • sex: map values
  • lifeStage: map values
  • eventRemarks: use boolean bait use
  • eventRemarks: add deploymentGroups? wontfix

Add assignment equivalent functions `deployments() <- value` , `media() <- value` and `observations() <- value`

Similar to purrr::pluck() <- : https://purrr.tidyverse.org/reference/pluck.html we would like to be able to create or modify objects nested within the datapackage structure.

Using the purrr syntax as an example:

https://github.com/tidyverse/purrr/blob/870696c7d9f3208298ea84a36d813ffd28e59e49/R/pluck.R#L103-L108

We probably want to make use of purrr::assign_in()

TODO

  • deployments(x) <-
  • media(x) <-
  • observations(x) <-
  • Update any reference in the code to x$data$name_of_df <- elsewhere in the code (e.g. here).

Scope

  • The assignment functions should only assign data.frames/tibbles
  • Do they need to be exported/public facing?
  • These assignment functions will provide in place assignment, and will modify the datapackage object they have as an argument
  • Similar to purrr::pluck() I suggest they return the modified object as well

Faster `read_camtrap_dp()`

In the new Camtrap DP format, observations no longer have a deploymentID. That means that any join between observations and deployments has to be done over media, which can be costly. @PietrH and I discussed how this could be implemented in an efficient way in camtraptor.

read_camtrap_dp()

  1. Reads a datapackage.json and stores as a "package" list object (same as now)
  2. Reads resources from disk with frictionless::read_resource() (same as now)
  3. Creates a joined dataframe: deployments > media > observations + distinct by eventID. This could be done with dplyr or data.table (new)

Creating a single df is a bad idea, see inbo/camtraptor#202 (comment). Rather, keep tables separate, but include tables for sequences, locations, taxonomy (cf https://wec.wur.nl/r/ctdp/)

  1. Attaches the joined dataframe to the package, e.g. at $data (new)
  2. No longer stores $data$deployments, $data$media, $data$observations to save memory (new)
  3. Returns the package (same as now)

filter()

  1. Masks dplyr::filter() for a camtrap dp object (#92)
  2. Uses dplyr::filter() (or data.table) on $data under the hood
  3. Can optionally also filter the spatial, temporal and taxonomic scope in the metadata

other functions

  1. Typically make use of $data rather than $data$deployments, ...
  2. Can still access media or other resources if they need to (by reading from disk)

Filter `spatial` and `temporal` in metadata with `filter_deployments()`

The spatial and temporal properties in the metadata are derived from the deployments. filter_deployments() affects these ranges and should therefore update those (potentially with a flag, update_metadata = TRUE). At the moment, this is less important because the metadata are not used within the R environment, but it will become important once packages are written back to disk.

Create `write_camtrapdp()` function

Suggested in camtraptor July 2023 coding sprint

It is a fair expectation that camtraptor also support writing Camtrap DP, to e.g. share cleaned datasets or publish it on the IPT.

Use cases

  • Reading Camtrap DP, cleaning and writing: inbo/camtraptor#248
  • Reading Camtrap DP, round_coordinates() and writing
  • Reading Camtrap DP, filtering and writing: #23
  • Reading Camtrap DP, merging and writing: #75
  • Reading Wildlife Insights export and writing

All these start from a package object. It will therefore be harder to create a Camtrap DP entirely from scratch, but it could be possible with:

p <- frictionless::create_package()
# user add necessary metadata as elements to list
deployments(p) <- df
media(p) <- df
observations(p) <- df
write_camtrap_dp(p, "directory")

Functionality

  • Make best attempt at creating a valid Camtrap DP. E.g. by running validate() #58 (and refusing to write if there are errors?)
  • read_camtrap_dp() %>% write_camtrap_dp() should create the exact same package (except for created date)
  • All existing metadata elements should be retained
  • If required metadata elements are not present, user should be warned (especially important if package was created from scratch)
  • taxonomy, spatial and temporal should be recreated based on the (filtered) data
  • A package that is the result of a merge() #75 should combine its metadata (not sure how)

Overview of functions

flowchart
    file["datapackage.json"] -->|file| f_read("read_camtrapdp()")
    f_read --> o(("camtrapdp"))
    o <--> f_convert("convert()")
    o --> f0("check_camtrapdp()")
    o --> f1("deployments()")
    o --> f2("media()")
    o --> f3("observations()")
    o --> f4("taxa()")
    o -->|by = locationID| f5("locations()")
    o -->|by = eventID| f6("events()")

Create `filter_deployments()`, `filter_observations()`, `filter_media()`

Suggested in camtraptor July 2023 sprint

  • Create 3 filter functions for the 3 tables.

  • Functions purposely use the same syntax as dplyr::filter()

    • First argument is package (not a data frame)
    • Following arguments are filter conditions, e.g. colname == value, colname >= integer, !is.na(colname), etc.
    • Function suggests column names (cf. dplyr filter)
    • Function should probably be developed as an extension to dplyr filter, see inbo/camtraptor#92 (comment)
  • Functions replace the species, sex, life_stage parameters in some functions like map_dep()

  • Functions replace the predicates functionality, so all pred_ functions can be removed:

    • apply_filter_predicate()
    • pred() -> filter(package, colname == "value")
    • pred_not() -> filter(package, !colname)
    • pred_gt() -> filter(package, colname > 1)
    • pred_gte() -> filter(package, colname >= 1)
    • pred_lt() -> filter(package, colname < 1)
    • pred_lte() -> filter(package, colname <= 1)
    • pred_in() -> filter(package, colname %in% c(1,2,3))
    • pred_notin() -> filter(package, !colname %in% c(1,2,3))
    • pred_na() -> filter(package, is.na(colname)
    • pred_notna() -> filter(package, !is.na(colname)
    • pred_and() -> filter(package, colname1 == "value1", colname2 == "value2") or &
    • pred_or() -> filter(package, colname1 == "value1" | colname2 == "value2")

Usage

package %>% filter_deployments(startDeployment >= "2022-09-05")

filter_deployments()

  • filters deployments
  • filters observations (on deploymentID)
  • filters media (on deploymentID)

filter_observations()

  • filters observations
  • does not filter deployments, so absences are retained
  • does not filter media, since a media file can be used by more than one observation and observations are not always hard linked to media does filter media

filter_media()

  • filters media
  • does not filter deployments
  • filters observations (on mediaID), but note that not all observations have these
  • could filter observations that fall within same deployment and same time range as filtered media

Tests are slow because of `example_dataset()`, should we cache it?

Tests take 47 seconds on my system, almost all because example_dataset() is downloaded time and time again.

I see some solutions:

A. We could only use it once in the tests, and reuse the object example_dataset <- example_dataset() , but that might cause trouble if we accidentally make changes to the example_dataset object.

B. We could use some sort of caching:

C. Or we could wait 47 seconds...

ref: https://en.wikipedia.org/wiki/Memoization

Add warning for `locations()`

locations() must return a warning:

  • if duplicates of locationID are present.
  • duplicates of locationName are present and locationID is NA.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.