Giter Site home page Giter Site logo

Comments (10)

MattBlissett avatar MattBlissett commented on September 21, 2024 2

Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing?

See https://data-blog.gbif.org/post/clustering-occurrences/, which describes what we're already doing and references Nicky's work.

from dwc-qa.

ben-norton avatar ben-norton commented on September 21, 2024 1

This is a significant issue in camera trapping. Most of the major projects (e.g., eMammal, Snapshot USA, Wildlife Insights) are collaborations between multiple institutions. These are referred to as 'initiatives' since they are larger than one 'organization' or 'institution'. Within those initiatives, providers may submit their own datasets as part of the effort.
Let's say a researcher publishes their camera trap dataset to GBIF. They manage their data in Wildlife Insights. Following the iNaturalist model, Wildlife Insights publishes their data to GBIF. This results in duplicate datasets on GBIF. The Wildlife Insights data will be significantly larger, but that doesn't negate the duplicate issue.
Scenario 2
Let's say a researcher publishes their camera trap dataset to GBIF. They manage their data using Wildlife Insights. Wildlife Insights doesn't publish data to GBIF. The researcher and Wildlife Insights would like to connect the dataset to other Wildlife Insights datasets on GBIF. Here, instead of Wildlife Insights publishing in bulk, a collection of datasets are connected, which as a whole represent the initiative.

from dwc-qa.

Jegelewicz avatar Jegelewicz commented on September 21, 2024 1

Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing?

Yes, please. There is no real way to do this. Institutions don't know and should not control what other institutions do with their data. Institutions also share pieces of a single thing and one should not be prevented from publishing just because the other one is.

from dwc-qa.

ymgan avatar ymgan commented on September 21, 2024 1

We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number.

I am not sure if this is relevant, but we use the project id in the eml, assuming that 1 dataset only has 1 project of course. GBIF can group these datasets together like this: https://www.gbif.org/dataset/search?project_id=BR~2F154~2FA1~2FRECTO

For datasets with multiple projects, an issue is opened here: gbif/ipt#1780

from dwc-qa.

debpaul avatar debpaul commented on September 21, 2024

@ben-norton Scenario 2 is what I'm thinking about (although I'd guess in the situation I'm thinking about, your first scenario is very likely going to happen).

It makes me think hard from a Latimer Core perspective. We're really talking about whole/parts relationships and the many variables around which we might pivot or group data.

So, for a given distributed project, then
a) They would agree to only submit to GBIF once
b) There'd be a field (Grant Number?) around which their data could be grouped.
c) (Or maybe each group in this consortium has a group ID? that goes with the grant number?)

Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing?

d) Even in that case ^^^ The original project would still like to see / grab through the API / visualize their aggregated data.

Where does this leave us? Are there standards in place to help us do this? Or do we have a gap?

from dwc-qa.

debpaul avatar debpaul commented on September 21, 2024

Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing?

Yes, please. There is no real way to do this. Institutions don't know and should not control what other institutions do with their data. Institutions also share pieces of a single thing and one should not be prevented from publishing just because the other one is.

Gotcha @Jegelewicz although in this case I'm really talking about projects, not really institutions. We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number. Something like this would be useful. When you start to tease it apart, many specimens will be touched / imaged / sampled / etc in connection to different grants. So it's also a one:many thing. We need a way to group the objects around that grant number...

from dwc-qa.

Jegelewicz avatar Jegelewicz commented on September 21, 2024

We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number.

I could see this being covered by the Identifier class in LatimerCore.

from dwc-qa.

debpaul avatar debpaul commented on September 21, 2024

We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number.

I could see this being covered by the Identifier class in LatimerCore.

Thanks! It's definitely parallel to the idea of pivoting different parts of the same collection in different ways.

from dwc-qa.

debpaul avatar debpaul commented on September 21, 2024

We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number.

I am not sure if this is relevant, but we use the project id in the eml, assuming that 1 dataset only has 1 project of course. GBIF can group these datasets together like this: https://www.gbif.org/dataset/search?project_id=BR~2F154~2FA1~2FRECTO

For datasets with multiple projects, an issue is opened here: gbif/ipt#1780

@ymgan in the scenario I'm describing, various groups across the USA would be collecting data (observations and specimens) on their own in their own areas of the USA. They'd be using a standard protocol. The goal, would be to have all these distributed sets be able to come together using a particular data point. Perhaps this Project ID in the EML could do that then. Does this sound parallel to what you are describing?

from dwc-qa.

ymgan avatar ymgan commented on September 21, 2024

Yes, that is parallel to what I am describing. However, it is at dataset level though. For the record level, indeed datasetName and datasetID seem to be for this purpose:

@dagendresen made a good remark here: gbif/pipelines#665 (comment)

One important reason or rationale is to group records produced or updated from different project funding. Similar to how the GBIF BID, BIFA, and CESP projects list datasets produced by this project funding. However, often we see project funding for georeferencing, or taxonomic validation and desire to "tag" the data records (or actually ultimately rather desire to "tag" the actual real-world collection specimens) that were georeferenced from a specific project funding --> to credit the funder and track fulfillment of the promise to the funder of e.g. georeferencing 10 000 collection specimens...

from dwc-qa.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.