Giter Site home page Giter Site logo

Comments (32)

peterdesmet avatar peterdesmet commented on August 17, 2024 1

The record in question is this one: https://www.gbif.org/occurrence/2631775528 (Natuurpunt:Waarnemingen:190863847).

Both Natagora and Natuurpunt have the field identificationVerificationStatus and both (see e.g. https://www.gbif.org/occurrence/2270408500) are publishing unverified records to GBIF (which is fine).

Since other datasets do not have this field, the only option I see is removing records that are explicitly marked as unverified, i.e.:

identificationVerificationStatus = "unverified"

from indicators.

timadriaens avatar timadriaens commented on August 17, 2024 1

Let's do this properly as it could have a big effect. Can we have a quick preview here of the number of records per identificationVerificationStatus and per classis for instance ? It might be that there are other relevant categories (and the verification types were adapted along the way after discussions with admins). There is also a category "pending" for example.

from indicators.

damianooldoni avatar damianooldoni commented on August 17, 2024 1

@timadriaens: as it seems important, I will not wait to check it while making a new cube. I try to find some time tomorrow or next week to tackle this.

from indicators.

peterdesmet avatar peterdesmet commented on August 17, 2024 1

If we want to exclude records that are marked as unvalidated (I'm fine with that), I suggest to do that for all processing (alien cube + all cube) and all datasets. It is clearer to explain.

from indicators.

SoVDH avatar SoVDH commented on August 17, 2024

This is of utmost importance! For the Walloon data, it is partly the biggest part of Max's work. He made a big effort to convince the experts to validate the datasets before publication. We also chose to validate ourselves the data from some experts for some taxonomic groups for which they had a very good expertise. That's part of the reason why it took so long. It seems that Natagora followed the same process as Natuurpunt.
I confirm what Tim just said above, only validated data can be used to run the indicators, to identify emerging species, to run the models for risk mapping. I know that we potentially 'lose' a lot of data, but here quality MUST take precedence over quantity! I also include here @amyjsdavis and @DiederikStrubbe as this discussion is relevant for them too.

from indicators.

amyjsdavis avatar amyjsdavis commented on August 17, 2024

from indicators.

damianooldoni avatar damianooldoni commented on August 17, 2024

The GBIF download used as start point for the occurrence cube pulished on Zenodo, contains 2447 distinct values of identificationVerificationStatus. Here below they are shown based on number of occurrences in descending order. As you can see there are a lot of unverified occurrences, 6.652.040, almost 19% of the data. The filtering based on issue (coordinate issues) and occurrenceStatus (absences) removes "just" 165.653 occurrences. So, even if all of them would be unverified the amount of unverified occurrences would still remain very high.

identificationVerificationStatus n
"" 15901818
"unverified" 6652040
"approved on knowledge rules" 6471644
"approved on expert judgement" 3598234
"approved on photographic evidence" 1273927
"verified" 748793
"Validated on the basis of rules" 60848
"Verified Observation" 33048
"validated by PAULY A" 27731
"validated by RASMONT P" 22352
"approved on photographic evide" 18643
"validated by LECLERCQ J" 18533
"Validated without evidence (additional information provided, ...)" 15668
"validated by D'Haeseleer J." 12881
"validated without a document in support (expertise or additional informations)" 10925
"validated by REMACLE A" 8211
... ...

from indicators.

qgroom avatar qgroom commented on August 17, 2024

Interesting!
It is not often appreciated that very common species don't need verifying, because even if the identification was wrong, there is a very good chance that that species is present within a grid cell anyway.
On the other hand, for rare species the numbers of false identifications can far exceed the number of correct identifications.
Therefore, you can happily accept the "unverified" records for common species, but where do you put the cut off?

from indicators.

damianooldoni avatar damianooldoni commented on August 17, 2024

This was just a relatively fast check.
I will investigate further by:

  1. searching for the datasets the unverified obs come from. Waarnemingen (Natuurpunt data) for sure, maybe other ones?
  2. grouping them by class as asked by @timadriaens
  3. grouping them by year (maybe most of them are "too" recent data from actual year? Then impact on our analysis is very limited)

Stay tuned 📻

from indicators.

timadriaens avatar timadriaens commented on August 17, 2024

Ok, but I think removing records that are explicitly marked as unverified is indeed fine.

from indicators.

damianooldoni avatar damianooldoni commented on August 17, 2024

As promised, a little more insight about the 6.652.040 unverified data in our GBIF download (date of download: 28 Jan 2020) containing occurrences in BE.

Datasets

Around 77% of the unverified data come from Waarnemingen.be - Bird occurrences in Flanders and the Brussels Capital Region, Belgium. Almost all of the datasets are "Natuurpunt" related data. One comes from Wallonia: Observations.be - Non-native species occurrences in Wallonia, Belgium. There is also an INBO dataset: Vlinderdatabank - Butterflies in Flanders and the Brussels Capital Region, Belgium.

title n datasetKey
Waarnemingen.be - Bird occurrences in Flanders and the Brussels Capital Region, Belgium 5137863 e7cbb0ed-04c6-44ce-ac86-ebe49f4efb28
Waarnemingen.be - Plant occurrences in Flanders and the Brussels Capital Region, Belgium 442505 bfc6fe18-77c7-4ede-a555-9207d60d1d86
Waarnemingen.be - Butterfly occurrences in Flanders and the Brussels Capital Region, Belgium 328363 1f968e89-ca96-4065-91a5-4858e736b5aa
Waarnemingen.be - Non-native animal occurrences in Flanders and the Brussels Capital Region, Belgium 281091 9a0b66df-7535-4f28-9f4e-5bc11b8b096c
Waarnemingen.be - Hymenoptera occurrences in Flanders and the Brussels Capital Region, Belgium 168301 71cfd412-6327-4ec7-8035-d8b2d0509ac5
Waarnemingen.be - Orthoptera occurrences in Flanders and the Brussels Capital Region, Belgium 99233 958b1d2f-2d11-4e94-a828-c8e2d2c013ca
Waarnemingen.be - Non-native plant occurrences in Flanders and the Brussels Capital Region, Belgium 61194 7f5e4129-0717-428e-876a-464fbd5d9a47
Observations.be - Non-native species occurrences in Wallonia, Belgium 44387 629befd5-fb45-4365-95c4-d07e72479b37
Waarnemingen.be - Hemiptera occurrences in Flanders and the Brussels Capital Region, Belgium 43826 37e094f3-dcf2-469f-93a2-c4b9b5fa7275
Vlinderdatabank - Butterflies in Flanders and the Brussels Capital Region, Belgium 20478 7888f666-f59e-4534-8478-3a10a3bfee45
Waarnemingen.be - Fish occurrences in Flanders and the Brussels Capital Region, Belgium 13963 8124cd73-ac84-43d2-ab39-1d80dc346525
Waarnemingen.be - Other insect occurrences in Flanders and the Brussels Capital Region, Belgium 10836 27e9e069-2862-4183-bcec-1e1a7f74d3e7

Classes

Here below the distribution of unverified occurrences at class level, ordered by n, number of occs. Empty class value = occs of taxa which don't belong to any class.

class kingdom n
Aves Animalia 5371281
Insecta Animalia 684639
Magnoliopsida Plantae 355435
Liliopsida Plantae 136518
Mammalia Animalia 51388
Actinopterygii Animalia 17016
Polypodiopsida Plantae 9383
Plantae 4767
Pinopsida Plantae 4084
Reptilia Animalia 3711
Amphibia Animalia 2973
Gastropoda Animalia 2732
Bryopsida Plantae 2033
Bivalvia Animalia 1401
Malacostraca Animalia 1211
Animalia 1201
Elasmobranchii Animalia 626
Jungermanniopsida Plantae 530
Leotiomycetes Fungi 293
incertae sedis 173
Phaeophyceae Chromista 136
Maxillopoda Animalia 94
Tentaculata Animalia 89
Lycopodiopsida Plantae 82
Ascidiacea Animalia 58
Arachnida Animalia 47
Cephalaspidomorphi Animalia 30
Hydrozoa Animalia 20
Polychaeta Animalia 19
Ginkgoopsida Plantae 15
Florideophyceae Plantae 13
Demospongiae Animalia 10
Gymnolaemata Animalia 9
Chilopoda Animalia 6
Anthozoa Animalia 5
Agaricomycetes Fungi 4
Phylactolaemata Animalia 4
Cephalopoda Animalia 2
Clitellata Animalia 1
Leptocardii Animalia 1

Years

Distribution among years in a plot (from 1980) and in a table where years are given in a descending order of number of occurrences , n. As the GBIF download has been triggered at 2020-01-28 there is still no data from waarnemingen.be which are updated monthly and so no unverified occs for 2020. There are also way less unverified data for 2019, due to a typical publishing delay, which is longer than 28 days. Both expected facts.

image

year n
2018 934689
2017 720012
2016 663146
2015 574543
2014 508016
2013 467497
2012 437100
2011 432699
2010 422923
2009 347533
2008 162985
2007 81114
2005 65306
2006 64999
2004 53171
2019 51730
1996 43995
2003 39278
2002 38855
1999 38227
1997 36425
2000 36116
1998 35821
1995 35511
2001 34515
1994 32038
1993 30831
1992 25246
1991 23535
1987 20743
1984 20421
1986 19869
1990 16730
1985 16152
1988 15827
1981 13706
1989 13307
1982 10714
1983 10649
1980 8561
1979 6545
1978 5408
1974 4790
1975 4563
1973 4532
1976 3675
1972 3459
1977 3453
1971 1746
1968 957
1969 728
1958 648
1959 534
1967 504
1970 482
1966 476
1964 452
1962 420
1965 387
1963 371
1961 363
1960 347
1957 345
1956 311
1955 306
1954 128
1919 123
1948 108
... < 100

I hope this first analysis give you all more elements to discuss. I would remove these data. I think that data quality is as important as transparency in science.

from indicators.

timadriaens avatar timadriaens commented on August 17, 2024

Indeed @damianooldoni this is as expected. obs.be/wnm.be have a well established validation flow and therefore have that field identificationVerificationStatus filled. Data from the vlinderdatabank are high quality atlas data and not very relevant to TrIAS (unless for the cube used for survey effort correction but I guess for that we don't need to exclude unverified as it is all about the effort) since they contain no non-native species. The distribution looks like it follows the same trend as the total number of observations.

@damianooldoni @peterdesmet @qgroom @SoVDH @amyjsdavis @DiederikStrubbe we exlude the unverified records from the occurrence based indicators. But do we keep all records to build the cube assuming that even an unverified records represents a survey effort? Or how do we deal with this?

from indicators.

SoVDH avatar SoVDH commented on August 17, 2024

At the risk of sounding like an extremist, I'd rule them out.

from indicators.

damianooldoni avatar damianooldoni commented on August 17, 2024

I agree with @SoVDH for two reasons:

  1. a minimum of data quality (= validation) is extremely important, no matter the goal the data are used for
  2. the indicators are built upon the cube, where data are already aggregated per year, taxon and grid cell. Making a distinction between verified and unverified data means adding an extra column validation (TRUE or FALSE) to maintaining things tidy. It would make the understanding of the occurrence cube more difficult and it don't think it's worth.

from indicators.

amyjsdavis avatar amyjsdavis commented on August 17, 2024

from indicators.

timadriaens avatar timadriaens commented on August 17, 2024

ok of course, but not sure @amyjsdavis will like it?

from indicators.

amyjsdavis avatar amyjsdavis commented on August 17, 2024

@DiederikStrubbe and I discussed this and now I have a better understanding. I am ok with you excluding the unverified data and I don't think this will substantially change the risk models.

from indicators.

amyjsdavis avatar amyjsdavis commented on August 17, 2024

@SoVDH : I have 17 out of 19 plant species SDM models for the risk assessment completed. These of course include the "unvalidated" or "unverified" label. Is it your preference that I run them again with these data excluded or do you want the maps now? It will take a few days, but it can be done.

from indicators.

timadriaens avatar timadriaens commented on August 17, 2024

Yes think that would be better.

from indicators.

damianooldoni avatar damianooldoni commented on August 17, 2024

@amyjsdavis: I thought you were using the cube for Europe I made for your SDM (eu_modellingtaxa_cube.csv,metadata: eu_modellingtaxa_info.csv). And in this cube there is no way to exclude unverified taxa. So, I wonder which occurrence data you are using.

By the way, I will try to make a new version of the cubes before end of June.

from indicators.

amyjsdavis avatar amyjsdavis commented on August 17, 2024

@damianooldoni : These are a different set of species (the plant species on the Species selection for PRA-ing list on google drive). They are different from the modellingtaxa list, and thus there is not an European cube. I realized that in the future, there will likely not be a cube for every species that are to be evaluated, so my modelling flow has the option to use the Cube as an input if already existing, or to process data directly downloaded from GBIF. As you may recall, I have to download global data for each species anyway for the models.

from indicators.

timadriaens avatar timadriaens commented on August 17, 2024

@amyjsdavis it's good to keep the options open, but there is a cube for every species on the unified and in fact on every spp.

from indicators.

timadriaens avatar timadriaens commented on August 17, 2024

Are your belgian maps just crops of a Eu/global risk map or how does that work?

from indicators.

amyjsdavis avatar amyjsdavis commented on August 17, 2024

@timadriaens : indeed, there is a cube for every species, but only for Belgium. The risk maps for Belgium are essentially a crop of a European risk model.

from indicators.

timadriaens avatar timadriaens commented on August 17, 2024

R we planning to do anything with the european maps? There is certainly interest cf Crassula helmsii and Muntiacus reevesi.

from indicators.

damianooldoni avatar damianooldoni commented on August 17, 2024

I would stop this interesting discussion here as it has nothing to do with verification anymore. I started a new one here: trias-project/occ-cube-alien#25

from indicators.

amyjsdavis avatar amyjsdavis commented on August 17, 2024

I have also seen "unvalidated' as an attribute for identificationVerificationStatus. Should those data also be excluded?

from indicators.

peterdesmet avatar peterdesmet commented on August 17, 2024

@amyjsdavis That was the term we were discussing or am I missing something?

from indicators.

peterdesmet avatar peterdesmet commented on August 17, 2024

Oh, you mean “unvalidated” in addition to “unverified”. Yes, those should ideally be removed as well. Which dataset did you find those in?

from indicators.

amyjsdavis avatar amyjsdavis commented on August 17, 2024

Yes, I found it in my global download for the plants for the risk assessment. I just happened to notice it for Symphyotrichum lanceolatum. The dataset provider is urn:lsid:swedishlifewatch.se:DataProvider:1, the dataset name is Artportalen (Swedish Species Observation System).

from indicators.

amyjsdavis avatar amyjsdavis commented on August 17, 2024

My global download dataset is here: https://doi.org/10.15468/dl.ruaasw

from indicators.

damianooldoni avatar damianooldoni commented on August 17, 2024

This issue can be closed as we filter out unvalidated data, see https://github.com/trias-project/occ-cube/blob/master/src/2_create_db.Rmd#L252-L260 and https://github.com/trias-project/occ-cube-alien/blob/master/src/belgium/2_create_db.Rmd#L288-L296 for the names of the issue whose occurrences we filter out.

from indicators.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.