Giter Site home page Giter Site logo

trias-project / unified-checklist Goto Github PK

View Code? Open in Web Editor NEW
0.0 6.0 1.0 33.48 MB

πŸ‡§πŸ‡ͺ Global Register of Introduced and Invasive Species - Belgium

Home Page: https://trias-project.github.io/unified-checklist/

License: MIT License

checklist r dataset gbif rstats oscibio invasive-species

unified-checklist's Introduction

Global Register of Introduced and Invasive Species - Belgium

Rationale

This repository contains the functionality to create and standardize the Global Register of Introduced and Invasive Species - Belgium to a Darwin Core checklist that can be harvested by GBIF.

This unified checklist is the result of two open and reproducible data pipelines developed for the TrIAS project (http://trias-project.be). In the data publication pipeline, we use the Checklist recipe to standardize and publish a selection of authoritative species checklists as Darwin Core Archives to GBIF. Predominantly, these checklists record the presence of alien species in Belgium for a specific taxon group or habitat and are maintained by their respective authors. In the data processing pipeline, we extract all Belgian non-native taxa from these checklists and unify their taxonomy using the GBIF Backbone Taxonomy. This automated process is implemented and documented at https://trias-project.github.io/unified-checklist/ The sources used for the unified checklist are:

  1. Manual of the Alien Plants of Belgium (Verloove et al. 2018)
  2. Checklist of alien birds of Belgium (Preda et al. 2019)
  3. Checklist of non-native freshwater fishes in Flanders, Belgium (Verreycken et al. 2018)
  4. Checklist of alien herpetofauna of Belgium (van Doorn et al. 2021)
  5. Inventory of alien macroinvertebrates in Flanders, Belgium (Boets et al. 2018)
  6. Registry of introduced terrestrial molluscs in Belgium (Backeljau et al. 2019)
  7. Checklist of alien species in the Scheldt estuary in Flanders, Belgium (Soors et al. 2021)
  8. Catalogue of the Rust Fungi of Belgium (Vanderweyen et al. 2018)
  9. WRiMS: World Register of Introduced Marine Species (Rius et al. 2023)
  10. Waarnemingen.be / observations.be - List of species observed in Belgium (Swinnen et al. 2022)
  11. Ad hoc checklist of alien species in Belgium (Reyserhove et al. 2018)
  12. RINSE - Pathways and vectors of biological invasions in Northwest Europe (Zieritz et al. 2018)

Workflow

See https://trias-project.github.io/unified-checklist/

Published dataset

Repo structure

The repository structure is based on Cookiecutter Data Science and the Checklist recipe. Files and directories indicated with GENERATED should not be edited manually.

β”œβ”€β”€ README.md              : Description of this repository
β”œβ”€β”€ LICENSE                : Repository license
β”œβ”€β”€ unified-checklist.Rproj : RStudio project file
β”œβ”€β”€ .gitignore             : Files and directories to be ignored by git
β”‚
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ raw                : Source data as downloaded from GBIF GENERATED
β”‚   β”œβ”€β”€ interim            : Unified data GENERATED
β”‚   └── processed          : Darwin Core output of mapping script GENERATED
β”‚
β”œβ”€β”€ references
β”‚   └── verification.tsv   : Verification file (for synonyms). Generated by 
β”‚                            3_verify_taxa.Rmd and then manually annotated
β”‚
β”œβ”€β”€ docs                   : Repository website GENERATED
β”‚
β”œβ”€β”€ index.Rmd              : Website homepage
β”œβ”€β”€ _bookdown.yml          : Settings to build website in docs/
β”‚
└── src
    β”œβ”€β”€ 1_get_taxa.Rmd     : Script to get taxa from checklists
    β”œβ”€β”€ 2_get_information.Rmd : Script to get related information
    β”œβ”€β”€ 3_verify_taxa.Rmd  : Script to verify taxa
    β”œβ”€β”€ 4_unify_taxa.Rmd   : Script to unify taxa
    β”œβ”€β”€ 5_unify_information.Rmd : Script to unify related information
    β”œβ”€β”€ 6_dwc_mapping.Rmd  : Script to map to Darwin Core
    └── 7_griis_mapping.Rmd : Script to map to create Excel file for GRIIS

Installation

  1. Clone this repository to your computer
  2. Open the RStudio project file
  3. Open the index.Rmd R Markdown file in RStudio
  4. Install any required packages
  5. Click Build > Build Book to generate the processed data and build the website in docs/

Publication

To publish an update of the dataset:

  1. Open the resource in the IPT (login required)
  2. Source data: upload the newly generated data files from data/processed
  3. Darwin Core mappings: does not require updates, unless terms were added/removed in the pipeline
  4. Metadata: does not require updates, except for:
    • Basic metadata: in description, check if number of taxa (2.500+) still applies
    • Taxonomic coverage: in description, update numbers per kingdom based on new data
    • Temporal coverage: update End date if need be
  5. Publish: click Publish, add a short description and publish
  6. Check if dataset is updated at GBIF (can take a couple of hours)

Contributors

List of contributors

License

MIT License for the code and documentation in this repository. The included data is released under another license.

unified-checklist's People

Contributors

damianooldoni avatar lienreyserhove avatar peterdesmet avatar pietrh avatar timadriaens avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

timadriaens

unified-checklist's Issues

Taxa with extensions information missing in taxon core

I found three taxa in speciesProfiles which are missing in the taxon core.

> speciesProfiles %>% anti_join(taxon, by = "id")
anti_join: added no columns
           > rows only in x       3
           > rows only in y  (   37)
           > matched rows    (2,999)
           >                 =======
           > rows total           3
# A tibble: 3 x 7
  id        is_marine is_freshwater is_terrestrial is_invasive habitat  source       
  <chr>     <lgl>     <lgl>         <lgl>          <lgl>       <chr>    <chr>        
1 https://~ FALSE     TRUE          TRUE           NA          freshwa~ https://www.~
2 https://~ FALSE     TRUE          TRUE           NA          freshwa~ https://www.~
3 https://~ FALSE     TRUE          TRUE           NA          freshwa~ https://www.~

id of these three taxa:

I think the origin of this bug is in this section of the workflow:
https://trias-project.github.io/unified-checklist/4_unify_taxa.html#explicitely-remove-incorrect-taxa

We remove these taxa from the taxon core but not from the extension.

Annoying message in unify_information.Rmd

Running code block unify_information-7 in unify_information.Rmd returns the following message:

Joining, by = "taxonKey" no non-missing arguments, returning NAno non-missing arguments

and this message is repeated over and over and over and over and over and over and over again. Perhaps a good idea to supress the message?

Render output with bookdown

Mainly because:

  • We want to be able to run several Rmd in series
  • We don't want all the site assets bundled up in the html
  • We want to output to /docs without the (function(input_file, encoding) { rmarkdown::render(input_file, encoding = encoding, output_file = paste0("../docs/",sub(".Rmd", ".html", basename(input_file))))}) hack

Established Union List species lacking from GRIIS

I just wanted to flag that Persicaria wallichii seems to be lacking from the GRIIS checklist in Belgium, whereas it is quite well etablished and has many records on wnm.be. This might be due to the many synonyms (Koenigia polystachya, ) and the mapping of Manual of Alien Plants onto the GRIIS unified.

In manual it is under Rubrivena polystachya which has gbif code 4037343

In waarnemingen.be it is under Persicaria wallichii, gbif code 6391908

@peterdesmet @LienReyserhove can you take a look please and make sure it is added to GRIIS Belgium (preferably under 8848208)?

Is the column `issues` still a list?

Tidylog reports no changes for this step:

6. Convert the column `issues` from a list to a concatenated string.
```{r get_taxa-10}
taxa <-
taxa %>%
mutate(issues = sapply(issues, toString)) %>%
mutate(issues = na_if(issues, "NA")) # Set "NA" strings to real NA
```

Maybe this step can be removed?

Function verify_synonyms()

verify_synonyms <- function(
    taxa = NULL # Dataframe with taxa to verify
    verified_synonyms = NULL # Dataframe with verified synonym info
)

Parameters

taxa: a dataframe with at least the following columns

  • backbone_taxonKey
  • backbone_scientificName
  • backbone_acceptedKey
  • backbone_accepted

verified_synonyms: a dataframe with at least the following columns

  • backbone_taxonKey
  • backbone_scientificName
  • backbone_acceptedKey
  • backbone_accepted
  • backbone_kingdom: to be populated from GBIF (is not in taxa)
  • date_added: to be populated by function
  • verified_key: to be populated manually by expert. Not required by this function, but any other functionality will use this key so it is good to check its existence.
  • remarks: NO CHECK, IS NOT REQUIRED

Functionality (pseudo code):

if in verified_synonyms:
     if taxa.backbone_scientificName != verified_synonyms.backbone_scientificName:
        update in verified_synonyms
        add to updated_scientificName
    if taxa.backbone_accepted != verified_synonyms.backbone_accepted:
        update in verified_synonyms
        add to updated_accepted
    else
        do nothing
else (not in verified_synonyms):
    add to verified_synonyms
    add to new_synonyms

if in verified_synonyms, but not in taxa:
    add to unused_synonyms

Return

  • verified_synonyms: same as input df, but now with updated info. Could be written to file outside the function.
  • new_synonyms: a subset of verified_synonyms (same columns) with synonyms relations that were added (found in taxa, but not in verified_synonyms)
  • unused_synonyms: a subset of verified_synonyms (same columns) with unused synonym relations (found in verified_synonyms, but not in taxa)
  • updated_scientificName: a df with backbone_scientificName + updated_backbone_scientificName
  • updated_accepted: a df with backbone_accepted + updated_backbone_accepted

Documentation

  • Document all above as succinct as possible with roxygen

Standardize degree of establishment

For the indicators (see inbo/reporting-rshiny-grofwildjacht#148 (comment)), but also for any user, it would good if the values for degree of establishment are standardized. There are two questions to answer:

1. What values to choose

Currently we use:

  • blackburn_et_al_2011:B3 in the Mollusca checklist
  • released (blackburn_2011:B3) in the birds checklist

See table 2 in https://doi.org/10.3897/biss.3.38084: the correct value to use would be the label, so just released

2. Where do we standardize

The quickest (and temporary way) to standardize is in the the indicators, but it would be better to standardize here in the unified checklist, or in the source checklists. As far as I can tell, there are only two: mollusca and birds, so it seems best to update them there:

Name changes to consider

Should be considered for change, now it's confusing.
To do after first publication:

  • SourceChecklist vs datasetKey
  • rename checklists.csv --> e.g. to checklists summary?

Format for sources

taxonRemarks for taxa

Lists the datasets considered for unifying information:

Sources considered for unifying information: https://DOI, https://DOI, https://DOI

Note: this is not the source for the taxon information (that would be GBIF Backbone taxonomy), but just extra information (hence why we put it in taxonRemarks)

Todo:

  • dwc_mapping.Rmd: add correct taxonRemarks format to taxa

Source for distribution, species profile, description

Includes the link to the specific checklist taxon that was used for this piece of information:

https://www.gbif.org/species/141264581: Nymphaea marliacea Marliac in Verloove F, Groom Q, Brosens D, Desmet P, Reyserhove L (2018). Manual of the Alien Plants of Belgium. Version 1.7. Botanic Garden Meise. Checklist dataset https://doi.org/10.15468/wtda1m

Todo:

  • get_taxa.Rmd: Remove "accessed via GBIF.org on xxxx-xx-xx." from citation and add as citation to checklists.csv
  • unify_information.Rmd: add sourceScientificName to distribution
  • unify_information.Rmd: add sourceScientificName to species profile
  • unify_information.Rmd: add sourceScientificName to description
  • dwc_mapping.Rmd: create small build_source_citation() function, taking the arguments taxonKey, scientificName, datasetCitation
  • dwc_mapping.Rmd: add correct source format to distribution
  • dwc_mapping.Rmd: add correct source format to species profile
  • dwc_mapping.Rmd: add correct source format to description

Species native in one Belgian region, but not in another

A problem (already raised in #32) arising from including regional distributions in the unified checklist (see #45 and #43) is that eventually, species that have been introduced in one region but are native in another, are included in the checklist of alien species of Belgium. I think it would be best to not include those in the unified? Some examples are

  • Podarcis muralis is there because it is non-native to Flanders (introduced with habitat material probably and still spreading along the railway network).
  • Natrix helvetica which was introduced in Flanders and Brussels but is native in Wallonia (and of conservation concern everywhere)

@damianooldoni @peterdesmet We should decide what to do with such cases, because it is strange they appear on a Belgian alien species checklist.

synonyms not linked in gbif backbone

Hi, there is a problem when downloading Orconectes limosus (our commonest North American crayfish) from gbif. The taxonomy changed and the accepted is now Faxonius limosus

Both are accepted names on gbif currently, but they are not linked. However, it is the same species (and it is on the union list...). How to solve this? @peterdesmet suggested adding both accepted names to the checklist? Should we also update the tsv with the Union List species?

Function get_taxa()

get_taxa <- function(
    taxon_keys = NULL
    checklist_keys = NULL
    limit = NULL
)

Parameters

If parameter taxon_keys:

  • Allows single taxon key (an integer)
  • Allows vector of taxon keys (vector of integers)
  • Functionality: use functionname_usage() to query e.g. http://api.gbif.org/v1/species/5231190
  • If used together with checklist_keys: assertion error

If parameter checklist_keys:

  • Works with single dataset key (a string)
  • Works with vector of datasets keys (vector of strings)
  • Functionality: use function name_usage(datasetKey = ...)
  • If used together with taxon_keys: assertion error

If parameter limit (e.g. 10)

  • With taxon_keys: limit to 10 taxa
  • With checklist_keys: limit to 10 taxa PER DATASET

Return

  • A dataframe with all returned attributes for any taxa:
key nubKey nameKey Β taxonID sourceTaxonKey kingdom ...
134087746 Β 5567657 15913604 2346 NA Plantae ...
5567657 5567657 Β 1439808 gbif:5567657 Β 123978664 Plantae ...
  • Columns can be in any order, but key (= input taxonKey) should be first.
  • Don't return metadata such as limit, offset

Warning

  • If taxonKey not found by nameUsage(): store in vector. At end of retrieving all keys, add those to the returned dataframe, with column key populated and all other columns NA. Also provide warning message, listing the keys that were not found at GBIF. Note: this should never happen with parameter checklist_keys. πŸ˜„

This function should replace:

  • checklist_import()
  • checklists_import()

Documentation

  • Document all above as succinct as possible with roxygen

Modify structure verification_file.csv

File verification_file.csv will be used by taxonomists for verifying taxa. At the moment, it contains the following columns in the following order:

  • scientificName
  • bb_scientificName
  • bb_taxonomicStatus
  • bb_acceptedName
  • bb_key
  • bb_acceptedKey
  • bb_kingdom
  • issues
  • verification_key
  • date_added
  • checklists
  • remarks

This structure is optimal for experts but too difficult to manage. Main source of bugs is the combination of the following properties:

  1. Mapping by names.
  2. Multiple checklists (comma separated) for same scientififcName allowed

Here below the columns of taxa.csv. I checked the box aside the columns I think we need to include in verification_file.csv:

  • taxonKey
  • scientificName
  • taxonID
  • datasetKey
  • nameType
  • issues
  • validDistribution
  • bb_key
  • bb_scientificName
  • bb_species
  • bb_genus
  • bb_family
  • bb_order
  • bb_class
  • bb_phylum
  • bb_kingdom
  • bb_rank
  • bb_speciesKey
  • bb_taxonomicStatus
  • bb_acceptedKey
  • bb_acceptedName

In addition to these columns, the next ones are peculiar of verification_file.csv and should be present:

  • verificationKey
  • dateAdded
  • remarks

Based on what we decide in this issue I will modify (= simplify) trias::verify_taxa().
@peterdesmet What do you think?

Input requested for some record-level terms

For some of the record-level terms, I'm not 100% sure what information to use:

  1. license: Taxa info comes from backbone (CC-BY), rest from datasets, might not always be CC0. I would use the most limiting license, i.e. CC-BY

  2. RightsHolder:

Organization who has the rights to the data and in the case of multiple rights holder, the organization who managed/made the decision to release those rights under CC0. Is often the same as publishing organization.

This is a difficult one, I'm tempted to say that the owners of the checklists are the rightsHolders, but then we're ignoring the taxonomic information from the backbone...

  1. institutionCode. Is now populated with "INBO", but according to our guidlines, this should be the same as rightsHolder

Restrict types of descriptions

The unified checklist currently contains:

type number of records
degree of establishment 68
Introduced species abundance 102
Introduced species impact 195
Introduced species management 4
Introduced species population trend 25
Introduced species remark 90
Introduced species vector dispersal 359
invasion stage 2,469
native range 4,109
pathway 3,153

@qgroom @timadriaens @SoVDH Should we restrict those to the ones we know about (i.e. those with lower case)? The others are from WRIMS.

Remove `data/output`

The directories data/interim and data/output both contain a verification file + taxa after verification. This should be cleaned up.

taxa_after_verification.csv makes sense in data/interim
verification_file.csv should probably be moved to data/raw, as it is a start file.

Carex pendula incorrectly on list, but not in manual

Originally reported in trias-project/alien-birds-checklist#13 by @timadriaens:

Hi, scrolling through some emerging species products further down the pipeline and through the unified checklist itself, there are still a number of species that should not occur on the unified checklist. I use this issue to report them:

  • Carex pendula (Hangende zegge is a native species, the reference given in the unified is https://doi.org/10.15468/wtda1m which is the MAP, but this species in not on MAP)

"Natural" is not a level 1 CBD pathway category

upon updating the biodiversity indicators for Flanders with @SanderDevisscher we noticed "natural" has become a category at level 1. Therefore, the indicator table on pathways also gives a number of species for unaided and for natural dispersal separately which is no good (they are synonymous). See the vocab table for pathways, here line 52 is in fact redundant because if level 1 is unaided then level 2 is always natural dispersal.

In fact, "unaided" (level 1) is the same as "natural dispersal" (level 2).
So probably something happens in the function which results in this.

Write Rmd to get taxa from checklists

  • 1. define checklists
  • 2. get taxa
  • 3. filter on origin = source
  • 4. write to file: checklist_taxa.tsv
  • 5. filter on nubkey
  • 6. filter on distribution
  • 7. get backbone info (see this commit for inspiration)
  • 8. write to another file: alien_backbone_taxa.tsv
  • 9. publish as webpage

Fields to include in DwC

Fields I found in GRIIS checklists are in bold.

taxon core

  • id: GBIF species key
  • modified: publication date of contributing checklist
  • language
  • license
  • rightsHolder: see #30
  • accessRights
  • bibliographicCitation: citation of contributing checklist = GBIF backbone
  • datasetID: DOI of unified checklist
  • institutionCode: not part of core
  • datasetName: name of unified checklist
  • taxonID: according to GBIF
  • scientificName: according to GBIF
  • acceptedNameUsageID: according to GBIF, will be identical to taxonID for all except SYNONYMS all considered accepted
  • acceptedNameUsage: according to GBIF, will be identical to scientificName for all except SYNONYMS all considered accepted
  • kingdom: according to GBIF
  • phylum: according to GBIF
  • class: according to GBIF
  • order: according to GBIF
  • family: according to GBIF
  • genus: according to GBIF
  • taxonRank: according to GBIF
  • nomenclaturalCode
  • taxonomicStatus: according to GBIF, most will be ACCEPTED not set
  • taxonRemarks: remark for SYNONYMs and why they are kept no: sources that were considered

distribution extension

  • id
  • locationID: "ISO_3166-2:BE"
  • locality: "Belgium"
  • countryCode: "BE"
  • occurrenceStatus: "present"
  • establishmentMeans: "introduced"
  • eventDate: widest range within used checklist
  • source: citation of contributing checklist OR actual source within that checklist

species profile

  • id: one species profile per taxon, from the most trust worthy checklist
  • isMarine
  • isFreshwater
  • isTerrestrial
  • isInvasive: we do not have this information
  • habitat: e.g. "terrestrial|freshwater", is same information as is... fields, so easy to repeat here

`progress_estimated()` is deprecated

Warning: progress_estimated() was deprecated in dplyr 1.0.0.

Progress bars are still shown though, but I assume a separate package should be loaded?

Expected output of verify_taxa.Rmd

File named taxa_after_verification.csv

  • key: same as in taxa.csv
  • scientificName: same as in taxa.csv
  • taxonID: same as in taxa.csv
  • datasetKey: same as in taxa.csv
  • nameType: same as in taxa.csv
  • issues: same as in taxa.csv
  • validDistribution: same as in taxa.csv
  • verifiedKey: same as in taxa.csv
  • bb_key = trias_verifiedKey
  • bb_scientificName: regarding trias_verifiedKey
  • bb_species: regarding trias_verifiedKey
  • bb_genus: regarding trias_verifiedKey
  • bb_family: regarding trias_verifiedKey
  • bb_order: regarding trias_verifiedKey
  • bb_class: regarding trias_verifiedKey
  • bb_phylum: regarding trias_verifiedKey
  • bb_kingdom: regarding trias_verifiedKey
  • bb_rank: regarding trias_verifiedKey
  • bb_speciesKey: regarding trias_verifiedKey
  • bb_taxonomicStatus: regarding trias_verifiedKey
  • bb_acceptedKey: regarding trias_verifiedKey
  • bb_acceptedName: regarding trias_verifiedKey

Note that trias_verifiedKey should only contain a single value. Taxa that are verified to multiple taxa should thus appear as multiple lines

How to get match ISSUES from GBIF

I noticed that the issues we collect in get_taxa.Rmd are issues with the backbone taxon, not issues of the checklist taxa:

We get the ORIGINAL_NAME_DERIVED in our issues, which isn't useful for us. Unfortunately, the lookup function doesn't seem to return issues as a column:

alien_plants <- rgbif::name_lookup(
  datasetKey = "9ff7d317-609b-4c08-bd86-3bc404b77c42",
  origin = "source",
  limit = 99999,
  return = "data"
)
colnames(alien_plants)
 [1] "key"                 "scientificName"     
 [3] "datasetKey"          "nubKey"             
 [5] "parentKey"           "parent"             
 [7] "kingdom"             "family"             
 [9] "kingdomKey"          "familyKey"          
[11] "canonicalName"       "nameType"           
[13] "taxonomicStatus"     "origin"             
[15] "numDescendants"      "numOccurrences"     
[17] "taxonID"             "habitats"           
[19] "nomenclaturalStatus" "threatStatuses"     
[21] "synonym"             "species"            
[23] "speciesKey"          "rank"               
[25] "genus"               "genusKey"           

New habitat information

Examine what habitat information is requested by GRIIS:

For e.g. all Fungi, Viruses, Hemiptera are listed as 'Host'. For Marine and Fish we are using WoRMS and Fishbase to assign the habitat type.
Would you have any objection to post the list with the reassigned habitat. We can highlight the changes for you.

Some thoughts about the website

While strolling through the unified-checklist website, I'm writing down some suggestions. Just some things that pop up in my mind, feel free to integrate or not.

  1. I wonder what the intended public is for this website. If it is intended for the broad (scientific) public with limited knowledge, I would add some information in certain parts: e.g.:
  • taxonKey can be used to verify manually on GBIF (section 2.3): I would add the URL (i.e. www.gbif.org/species/taxonKey). Maybe also refer to the verbatim page, as the overview page for e.g. Pilosella x brachiata indicates that the species is present in Belgium, but not that the presence of this species is uncertain
  1. Section 3: I would add a preview of each of the resulting datasets

  2. section 6.3 Unifiy distributions

  • What is the reason to use startYear as endYear when no endYear is provided in the unified checklist? (in the first step) I think we need to think about a general strategy for dates because:

    • alien plants: when only one date is available, eventDate is OR start year OR end year (in the case of the alien plants, start year in the unified checklist could just be end year as well)
    • alien fishes: when one date is available: use the year of the publication of the checklist as end year
  • sometimes, we have no date information at all. Should also be mentioned in the text somewhere (i.e. what about NA's)

  • In the same section, I wonder if it would be possible to give more previews as well, sometimes it's hard to visualise the steps we undertake. A preview could be helpfull here.

Species lacking regional distribution

Some species are currently lacking the flemish region locality on the "include-distribution-regions" - branch. These include:

Species Key
Alopochen aegyptiaca 2498252
Oxyura jamaicensis 2498305
Ludwigia grandiflora 5421039
Ludwigia peploides 5420991

According to Tim the missing birds can be explained by the fact the birds checklist is not yet published. This also explains the missing of Sacred ibis from the unified checklist.

Malva sylvestris (cultivar)

Malva sylvestris (cultivar), often called mauritiana, is in the non-native plants dataset of waarnemingen.be (however with the same Specieskey as the native Malva sylvestris). This is a bit strange of course to find a native species key in the non-native plants dataset but understandable and relates to taxonomic issues/interpretation (it is not in MAP and MAP says that it hardly deserves a taxonomic status). However, the dutch do consider it see https://www.verspreidingsatlas.nl/6891 (if you click taxonomy you can directly look for it on gbif, handy!).

The thing is that now, we can't evaluate emerging status of this "seed mixture escape" since that is produced only for species on the unified. Just want to flag this, and discuss about potential solutions.

Verify new taxa

415 taxa are currently not verified. 421 are from the new run on 2023-09-15 (which includes the waarnemingen.be checklist). We should make an effort and verifying the suggested synonymy where we can. See spreadsheet.

  • Previously listed synonyms for which the scientific name was updated in the backbone ok
  • Previously listed synonyms for which the accepted scientific name was updated in the backbone ok

comparison with Belgian DAISIE dataset

I would like to start this issue as a forms of quality control of the Belgian GRIIS checklist after some in vain attempts to discuss this at overcrowded TrIAS core group meetings and some discussions with @SanderDevisscher but we lacked to time to address this. After all this is also one of the main reasons for having published DAISIE within the AlienCSI STSM. The procedure is relatively simple I think, the big work will be in finding the appropriate source to complement GRIIS Belgium:

  • select belgian species from DAISIE checklist, as locality = Belgium
  • do an outer join with GRIIS Belgium based on common field (taxonkey) to see which species GRIIS Belgium does not have and DAISIE does and vice versa
  • in depth quality control based on the fields occurrenceStatus, establishmentMeans, eventdata be used to explore differences with GRIIS Belgium and perhaps to improve it provided we have sources - we will also have an idea on mistakes in both registers
  • region_of_first_record is also interesting

Standardization native range values

After review pipeline, next (final) step will be standardization of vocabularies. While working on producing an unified checklist of alien species of Belgium I noticed that native range assumes values at a variety of levels (country level, continental level, climate level, origin level). While reading data from the following six checklists:

  1. Manual of the Alien Plants of Belgium
  2. Inventory of alien macroinvertebrates in Flanders, Belgium
  3. Checklist of non-native freshwater fishes in Flanders, Belgium
  4. Catalogue of the Rust Fungi of Belgium
  5. RINSE - Registry of non-native species in the Two Seas region countries (Great Britain, France, Belgium and the Netherlands)
  6. RINSE - Pathways and vectors of biological invasions in Northwest Europe

I get the following values:
Africa, Africa (WGSRPD:2), Arctic, Asia, Australasia (WGSRPD:5), Australia, Australia (WGSRPD:50), China, cultivated origin, East Asia, Eastern Europe, Europe (WGSRPD:1), hybrid origin, Indo-Pacific, New Zealand, North Africa, Northeast Asia, Northern America, Northern America (WGSRPD:7), pan-American, Pantropical, Ponto-Caspian, South America, Southeast Asia, Southern America (WGSRPD:8), Southern Europe, Southern Hemisphere, temperate Asia (WGSRPD:3), Tropical and warm seas, tropical Asia (WGSRPD:4), United States, West Africa, Western Atlantic.

@peterdesmet: WGSRPD has been conceived specifically for plant distribution. Are there good practice guidelines for distribution of species belonging to other kingdoms? Does an unique controlled vocabulary for all kingdoms? Any other idea about standardization? I don't see immediately a solution. Thanks in advance.

remove native birds from unified checklist

while checking the birds extract from the unified there appear to be a number of species native to Belgium on the unified checklist

Species Argumentation
Perdix perdix (Linnaeus, 1758) native red list species but indeed restocking occurs regularly for hunting
Dryocopus martius (Linnaeus, 1758) native black woodpecker
Bubulcus ibis (Linnaeus, 1758) vagrant, more and more seen, also kept in aviaries
Nycticorax nycticorax (Linnaeus, 1758) native breeding heron species, but also kept in aviaries
Ciconia nigra (Linnaeus, 1758) native black stork, rare breeder
Emberiza hortulana Linnaeus, 1758 probably also kept in aviaries but would take it out, rare vagrant
Tarsiger cyanurus (Pallas, 1773) rare vagrant
Athene noctua (Scopoli, 1769) native little owl
Anser fabalis (Latham, 1787) winter migrant, possibly also bred in waterfowl collections
Anser anser (Linnaeus, 1758) mixed population of wild and escaped birds but native to Belgium
Milvus migrans (Boddaert, 1783) native
Bubo bubo (Linnaeus, 1758) native, breeding, but kept widely in collections
Branta leucopsis (Bechstein, 1803) native, breeding, wintering, mixed population of wild and birds of escaped origin

these would imo best be taken out of the unified for now @LienReyserhove .

I think we should reference ISO_3166 not ISO_3166-2

We reference ISO_3166-2 for our locationIDs:

https://github.com/trias-project/alien-plants-belgium/blob/a62c4a9d5beaa2e421b51fed50bce697eabdf613/src/dwc_mapping.Rmd#L467-L473

But ISO_3166-2 is only for the subdivisions of countries. So correct for BE-WAL, but not for BE.

I propose to just reference the general ISO_3166 (e.g. ISO_3166:BE-WAL, ISO_3166:BE), rather than one of its parts. That more broader namespace is not going to create name clashes either (i.e. any code on ISO_3166-2 will not appear on ISO_3166-1 or 3).

@qgroom @LienReyserhove Suggestions?

Humulus/Humulopsis scandens/japonicus lacking from unified

Upon checking the country level status for the EASIN baseline distribution of Union Concern species for the 3rd batch species, I noticed Humulus scandens (aka Humulus japonicus) does not occur on the unified (whereas it should, as it is in the Manual of Alien Plants).

There is something strange going on. I can't find the species in the taxon.txt from the gbif download of the unified, also gbif does not state it for GRIIS Belgium (not under Humulus scandens (Lour.) Merr. nor under the synonym Humulus japonicus Sieb. & Zucc. , nor under Humulopsis scandens (Lour.) Grudz.).

However, GBIF does present description data on Humulus japonicus Sieb. & Zucc. coming from the Manual.

NATIVE RANGE
Asia
source: Manual of the Alien Plants of Belgium

INVASION STAGE
casual
source: Manual of the Alien Plants of Belgium

PATHWAY
cbd_2014_pathway:escape_horticulture
source: Manual of the Alien Plants of Belgium

How is this at all possible @peterdesmet @qgroom ? And how to add the species to the Belgian checklist? Does this have to do with verification again? This is problematic for Union List IAS for which we have to officially report. As good Belgians we should at least try to get our hops right :-)

Missing prefix "cbd_2014_pathway:" in pathway info of two datasets

I found that the pathway info linked to types "introduction pathway" (Checklist of alien herpetofauna of Belgium) and "pathway of introduction" (Checklist of alien species in the Scheldt estuary in Flanders, Belgium) discussed in #68 is not standardized as it doesn't start with the typical cbd_2014_pathway: prefix. @peterdesmet: I suppose this should be improved at checklist level as for #68, right?

Here a table of the values I found in description:

suspect_type_pathways <- c(
  "introduction pathway",
  "pathway of introduction"
)

description %>% 
  filter(type %in% suspect_type_pathways) %>% 
  distinct(type, description) %>%
  arrange(type, description)
# A tibble: 19 x 2
   type                    description                  
   <chr>                   <chr>                        
 1 introduction pathway    contaminant                  
 2 introduction pathway    contaminant_timber           
 3 introduction pathway    escape_pet                   
 4 introduction pathway    escape_research              
 5 introduction pathway    nursery                      
 6 introduction pathway    release_conservation         
 7 introduction pathway    release_landscape_improvement
 8 introduction pathway    release_other                
 9 introduction pathway    stowaway_container           
10 introduction pathway    stowaway_other               
11 pathway of introduction contaminant_plant            
12 pathway of introduction corridor_water               
13 pathway of introduction escape_aquaculture           
14 pathway of introduction escape_pet                   
15 pathway of introduction release_fishery              
16 pathway of introduction stowaway                     
17 pathway of introduction stowaway_ballast_water       
18 pathway of introduction stowaway_hull_fouling        
19 pathway of introduction stowaway_ship                

Analysis of the infraspecific taxa

@timadriaens found strange that there was no analysis of Trachemys scripta, although this species is in the Union list of concern species. The answer was simple: in the unified there are three subspecies of it, but not the species itself (see www.gbif.org/species/163636890), although it is an alien species.

Why are there infraspecific taxa in unified and source checklists? There are two possible reasons, so we can divide such taxa in two groups:

  1. there are both alien and native subspecies present in Belgium
  2. their native ranges, pathways, habitats, distributions or year of introduction are different

Subspecies of Trachemys scripta belong to second group as the 3 subspecies have different date of first observed in ad-hoc checklist (see raw dump). But the species itself is alien...
So, how to distinguish infraspecific taxa of group 1 from those of group 2? An idea could be to add the species to taxon core without extensions, so no informations about pathways, native range, distribution etc. are present as they are specified at infraspecific level.

This solution will not affect checklist indicators at all as they are based on the information in the extensions. For occurrence indicators we should add a step in the pipeline for making the cube to avoid doubling the number of occurrences, but it is feasible.

In this file, infraspecific_alien_taxa_source_info.txt, you can find a list with all infraspecific taxa in the unified with their source checklist where key is the key of the taxon in the unified (e.g. 152543132) and nubKey is the key of the corresponding taxon in the GBIF backbone (e.g. 6157050).
@timadriaens and I think it is worth to find them. They should not be so many.

@peterdesmet, @qgroom & co.: what do you think about it? Is it something to add at unified level? Or at source checklist level?

How to add more WRIMS distributions

WRIMS distributions need a combination of:

"country": "BE",
"status": "PRESENT"
"establishmentMeans": "INTRODUCED",

... to be selected. Ideally they have temporal information:

"temporal": "2000",

I notice many distributions only have part of the information: e.g. year, but not status. Search for example for "BE" in https://api.gbif.org/v1/species/157131005/distributions: there are 8 distributions for Belgium, but none have all 3 properties and year.

I wonder if we should drop the status field in our selection.

Documentation for index.Rmd

The use of index.Rmd as first file to run in Rstudio (containing the packages etc.) should be documented in the README.

Pathway information in different types

While trying to update indicators and helping @timadriaens for making some graphs, I found the following:

description %>% distinct(type)
distinct: removed 12,232 rows (>99%), 12 rows remaining
# A tibble: 12 x 1
   type                               
   <chr>                              
 1 pathway                            
 2 degree of establishment            
 3 native range                       
 4 Introduced species vector dispersal
 5 Introduced species impact          
 6 Introduced species abundance       
 7 introduction pathway               
 8 Introduced species remark          
 9 Introduced species management      
10 pathway of introduction            
11 Introduced species population      
12 Introduced species population trend

There are 3(!) types for encoding pathway information, where type pathway is the most used (and the correct one):

type_pathways<- c(
  "pathway",
  "introduction pathway",
  "pathway of introduction"
)
description %>% 
  filter(type %in% type_pathways) %>% 
  group_by(type) %>% 
  count() %>%
  arrange(desc(n))
# A tibble: 3 x 2
# Groups:   type [3]
  type                        n
  <chr>                   <int>
1 pathway                  3283
2 introduction pathway      163
3 pathway of introduction    61

Description data with type introduction pathway

All these data come from the Checklist of alien herpetofauna of Belgium (https://doi.org/10.15468/pnxu4c):

description %>%
  filter(type %in% "introduction pathway") %>%
  group_by(type) %>%
  distinct(source) %>%
  mutate(from_herpetofauna = str_detect(.data$source,
                                        pattern  = "herpetofauna",
                                        negate = FALSE)) %>%
  group_by(from_herpetofauna) %>%
  count()
from_herpetofauna     n
  <lgl>             <int>
1 TRUE                 89

Description data with type pathway of introduction

All these data come from Checklist of alien species in the Scheldt estuary in Flanders, Belgium (https://doi.org/10.15468/8zq9s4):

description %>%
  filter(type %in% "pathway of introduction") %>%
  group_by(type) %>%
  distinct(source) %>% 
  mutate(from_scheldt_estuary = str_detect(.data$source,
                                        pattern  = "Scheldt estuary",
                                        negate = FALSE)) %>%
  group_by(from_scheldt_estuary) %>%
  count()
from_scheldt_estuary     n
  <lgl>                <int>
1 TRUE                    54

I think this are issues to solve at checklist level. @peterdesmet: what do you think?

species lacking from unified

I accidentally noticed Crassula helmsii does not occur on the unified. It is one of our most notorious invasive plant species. I see the accepted (that also occurs in the other GRIIS checklists) is this one, however, when mapping rinse and manual of alien plants is was mapped on the synonym .

@peterdesmet @damianooldoni can you check what went wrong here?

quality control GRIIS Belgium - fish

Whilst scanning through the list of species selected for the MijnVismaat occurrences dataset, noticed a few native species in there. I suspect this has to do with using GRIIS Belgium as a selection filter and indeed, those species appear to be on GRIIS Belgium:

  • Silurus glanis (GBIF key 2337607): although it is true that the current population of this species is the result of (re)introductions, there is historic proof it is a native species so should have degree of establishment nativeReintroduced
  • Leuciscus idus (GBIF key 4409643): simply native

These species do not belong on the alien species checklist for Belgium. They originate from the Zieritz et al. checklist.

Problems with duplicate descriptions due to different vocabularies?

To select unique descriptions in the unified checklist, we apply the following code to select the descriptions across checklists (section 6.5 point 3):

 # Group by type and verificationKey across checklists
  group_by(
    type,
    description,
    verificationKey
  ) %>%
  
  # Select first datasetKey, taxonKey and scientificName
  summarize(
    datasetKey = first(datasetKey),
    taxonKey = first(taxonKey),
    scientificName = first(scientificName)
  ) %>%

By grouping by both type, description and verificationKey, we risk to select duplicated descriptions due to the use of different vocabularies. An example:

verificationKey type description taxonKey
a native range Northern America 1
a native range Southern America 1
b native range North America 1

Here, all descriptions for this species will be selected, due to the use of a different vocabulary.
To be considered....

strange behavior name_usage(datasetKey = c("key1", "key2"))

I noticed quite a strange behavior while using RGBIF's function name_usage()with datasetKey equal to a character vector. Try to run the code here below:

test1 <- name_usage(datasetKey = "73605f3a-af85-4ade-bbc5-522bfb90d847")
test2 <- name_usage(datasetKey = "d7c60346-44b6-400d-ba27-8d3fbeffc8a5")
test3 <- name_usage(datasetKey = c("73605f3a-af85-4ade-bbc5-522bfb90d847",
                                   "d7c60346-44b6-400d-ba27-8d3fbeffc8a5"))
test4 <- name_usage(datasetKey = c("d7c60346-44b6-400d-ba27-8d3fbeffc8a5",
                                   "73605f3a-af85-4ade-bbc5-522bfb90d847")) #invert order datasetKeys
c(nrow(test1$data), nrow(test2$data), nrow(test3$data), nrow(test4$data))
c(unique(test1$data$datasetKey), unique(test2$data$datasetKey), 
  unique(test3$data$datasetKey), unique(test4$data$datasetKey))

I got this:

c(nrow(test1$data), nrow(test2$data), nrow(test3$data), nrow(test4$data))
10  7 10 10
unique(test1$data$datasetKey) 
[1] "73605f3a-af85-4ade-bbc5-522bfb90d847"
unique(test2$data$datasetKey)
[1] "d7c60346-44b6-400d-ba27-8d3fbeffc8a5"
unique(test3$data$datasetKey)
[1] "73605f3a-af85-4ade-bbc5-522bfb90d847"
unique(test4$data$datasetKey)
[1] "73605f3a-af85-4ade-bbc5-522bfb90d847"

I was expecting:

c(nrow(test1$data), nrow(test2$data), nrow(test3$data), nrow(test4$data))
10  7 17 17
unique(test1$data$datasetKey) 
[1] "73605f3a-af85-4ade-bbc5-522bfb90d847"
unique(test2$data$datasetKey)
[1] "d7c60346-44b6-400d-ba27-8d3fbeffc8a5"
unique(test3$data$datasetKey)
[1] "73605f3a-af85-4ade-bbc5-522bfb90d847", "73605f3a-af85-4ade-bbc5-522bfb90d847"
unique(test4$data$datasetKey)
[1] "73605f3a-af85-4ade-bbc5-522bfb90d847", "73605f3a-af85-4ade-bbc5-522bfb90d847"

Strange. What do you think? Should I start an issue on rgbif?

Only include species

The checklist currently contains:

taxonRank taxa
GENUS 11
SPECIES 2399
SUBSPECIES 101
VARIETY 18

For a unified checklist, I think it makes sense to aggregate this information on SPECIES only, because:

  • This is likely the taxonomic unit used for reporting
  • This is likely the taxonomic unit used for analyses
  • This makes the unified checklist easier to work with (e.g. for querying occurrences)
  • This would be a documented step in the processing
  • The original subspecies ranks would still be present in the original checklists, so it is traceable

This would only affect 5% of the taxa (i.e. the non-SPECIES):

  • Lumping SUBSPECIES and VARIETY under their SPECIES: this can reduce the number of taxa if more than one SUBSPECIES or VARIETY is listed for a SPECIES
  • Dropping GENUS for which no species is listed on a checklist (8 taxa)

I've created a spreadsheet of the taxa that are affected.

@timadriaens @SoVDH @qgroom would you be OK with this choice?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.