Giter Site home page Giter Site logo

Comments (4)

peterdesmet avatar peterdesmet commented on August 17, 2024

by @amyjsdavis: I originally envisioned the modeling framework building on code already developed by you and others on the TrIAS team, but being an entirely separate process beginning with the data download since I require global data and I also require more stringent filtering than what is needed for the development of the indicators. However, I understand yours and Peter's desire to not overwhelm the GBIF platform and to develop a data cleaning pipeline. I see this as the only potential area of overlap between our work packages.

If you want, we can get together to discuss this early next year, or just work on it remotely.

To give you an idea of my requirements, here is the R code I am currently using to filter my data.

#Clean global occurrence data (occ)

library(rgbif)
library(tidyverse)

occ_clean<- occ %>%
   filter(basisOfRecord!="FOSSIL_SPECIMEN") %>%
   filter(hasCoordinate =="TRUE") %>%
   filter(hasGeospatialIssues=="FALSE") %>%
   filter(is.na(coordinateUncertaintyInMeters)| coordinateUncertaintyInMeters< 708) %>%
   select(taxonKey,species, scientificName,decimalLatitude,decimalLongitude,eventDate,year,coordinateUncertaintyInMeters,datasetKey,countryCode,establishmentMeans)%>%#select desirable variables
   filter(!grepl("^[0-9]+(\\.[0-9]{0,1})?$",decimalLatitude))%>%
   filter(!grepl("^[0-9]+(\\.[0-9]{0,1})?$",decimalLongitude)) #filter points with unacceptable accuracy

from indicators.

amyjsdavis avatar amyjsdavis commented on August 17, 2024

Let's meet on the 15th of January at 13h:00. Room details to follow.

from indicators.

damianooldoni avatar damianooldoni commented on August 17, 2024
  1. Data cleaning filters used both for indicators (IND) and risk assessment (RA) can be passed immediately to GBIF for triggering a (smaller) download.
  2. The filters which are not used by both, would be implemented in two separate data cleaning pipelines in a second step.

Query

https://www.gbif.org/occurrence/search?https://www.gbif.org/occurrence/search?basis_of_record=OBSERVATION&basis_of_record=HUMAN_OBSERVATION&basis_of_record=MATERIAL_SAMPLE&basis_of_record=LITERATURE&basis_of_record=PRESERVED_SPECIMEN&basis_of_record=UNKNOWN&basis_of_record=MACHINE_OBSERVATION&year=1000,2019&has_coordinate=true&has_geospatial_issue=false

Filters

Taxa

taxon_key= ...
  1. IND: all species of the unified checklist
  2. RA: list of 50+ species to model

Country

country = ...

Populated by GBIF if not provided (Country derived from coordinates). The field countryCode is not a common field.

  1. IND: country=BE
  2. RA: global (no filter)

basisOfRecord

basis_of_record=OBSERVATION&
basis_of_record=HUMAN_OBSERVATION&
basis_of_record=MATERIAL_SAMPLE&
basis_of_record=LITERATURE&
basis_of_record=PRESERVED_SPECIMEN&
basis_of_record=UNKNOWN&
basis_of_record=MACHINE_OBSERVATION

Everything except FOSSIL_SPECIMEN (is historical distribution) and LIVING_SPECIMEN (includes location of captured taxa). https://rs.gbif.org/vocabulary/dwc/basis_of_record.xml. MACHINE_OBSERVATION will overestimate the presence of some species in particular areas, but this since the data will be aggregated per grid (yes/no per year) that shouldn't be a problem and we wouldn't want to leave out lower sampling machine observations (e.g. camera traps).

presence

no filter exists

We will need to filter out absence records. There is no download filter for this, so we will need to filter those out later.

occurrenceStatus != absent or excluded
individualCount != 0

year

year = 1000,2019 # There is no year present filter

We work at year level, so this field is the only one which is really important. Year is populated from the data by GBIF.

coordinates

has_coordinate=true

We need coordinate information (decimalLatitude and decimalLongitude) for occupancy (IND) and comparing with climate data (RA). Note that we will use coordinateUncertaintyInMeters information as an indicator of quality in the aggregated data (#44), but not as a filter.

coordinate issues

# won't filter in download

There is a has_geospatial_issue flag that is added by GBIF and covers the coordinate issues that are indicated with **:

**ZERO_COORDINATE**
**COORDINATE_OUT_OF_RANGE**
**COORDINATE_INVALID**
COORDINATE_ROUNDED
GEODETIC_DATUM_INVALID
GEODETIC_DATUM_ASSUMED_WGS84
COORDINATE_REPROJECTED
COORDINATE_REPROJECTION_FAILED
COORDINATE_REPROJECTION_SUSPICIOUS
COORDINATE_ACCURACY_INVALID
COORDINATE_PRECISION_INVALID
COORDINATE_UNCERTAINTY_METERS_INVALID
COORDINATE_PRECISION_UNCERTAINTY_MISMATCH
**COUNTRY_COORDINATE_MISMATCH**
COUNTRY_MISMATCH
COUNTRY_INVALID
COUNTRY_DERIVED_FROM_COORDINATES
CONTINENT_COUNTRY_MISMATCH
CONTINENT_INVALID
CONTINENT_DERIVED_FROM_COORDINATES
PRESUMED_SWAPPED_COORDINATE
PRESUMED_NEGATED_LONGITUDE
PRESUMED_NEGATED_LATITUDE

That looks all good to exclude, except for COUNTRY_COORDINATE_MISMATCH, which can happen when the center of a grid is outside Belgium, e.g. https://www.gbif.org/occurrence/874553125

We will therefore exclude after the download:

ZERO_COORDINATE
COORDINATE_OUT_OF_RANGE
COORDINATE_INVALID

other issue filters

Below a list of other potential issues, but we won't filter on these.

RECORDED_DATE_MISMATCH
RECORDED_DATE_INVALID
RECORDED_DATE_UNLIKELY
TAXON_MATCH_FUZZY
TAXON_MATCH_HIGHERRANK
TAXON_MATCH_NONE
DEPTH_NOT_METRIC
DEPTH_UNLIKELY
DEPTH_MIN_MAX_SWAPPED
DEPTH_NON_NUMERIC
ELEVATION_UNLIKELY
ELEVATION_MIN_MAX_SWAPPED
ELEVATION_NOT_METRIC
ELEVATION_NON_NUMERIC
MODIFIED_DATE_INVALID
MODIFIED_DATE_UNLIKELY
IDENTIFIED_DATE_UNLIKELY
IDENTIFIED_DATE_INVALID
BASIS_OF_RECORD_INVALID
TYPE_STATUS_INVALID
MULTIMEDIA_DATE_INVALID
MULTIMEDIA_URI_INVALID
REFERENCES_URI_INVALID
INTERPRETATION_ERROR
INDIVIDUAL_COUNT_INVALID

from indicators.

damianooldoni avatar damianooldoni commented on August 17, 2024

Done in pipelines src/download.Rmd and src/create_db.Rmd of repo occ-processing.
Issue can be closed.

from indicators.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.