Question about the use of the dwc:datasetName field. Scenari

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

dwc:datasetName use to group datasets, yes, no, options please? about dwc-qa HOT 10 OPEN

debpaul commented on September 21, 2024

dwc:datasetName use to group datasets, yes, no, options please?

from dwc-qa.

Comments (10)

MattBlissett commented on September 21, 2024 2

Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing?

See https://data-blog.gbif.org/post/clustering-occurrences/, which describes what we're already doing and references Nicky's work.

from dwc-qa.

ben-norton commented on September 21, 2024 1

This is a significant issue in camera trapping. Most of the major projects (e.g., eMammal, Snapshot USA, Wildlife Insights) are collaborations between multiple institutions. These are referred to as 'initiatives' since they are larger than one 'organization' or 'institution'. Within those initiatives, providers may submit their own datasets as part of the effort.
Let's say a researcher publishes their camera trap dataset to GBIF. They manage their data in Wildlife Insights. Following the iNaturalist model, Wildlife Insights publishes their data to GBIF. This results in duplicate datasets on GBIF. The Wildlife Insights data will be significantly larger, but that doesn't negate the duplicate issue.
Scenario 2
Let's say a researcher publishes their camera trap dataset to GBIF. They manage their data using Wildlife Insights. Wildlife Insights doesn't publish data to GBIF. The researcher and Wildlife Insights would like to connect the dataset to other Wildlife Insights datasets on GBIF. Here, instead of Wildlife Insights publishing in bulk, a collection of datasets are connected, which as a whole represent the initiative.

from dwc-qa.

Jegelewicz commented on September 21, 2024 1

Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing?

Yes, please. There is no real way to do this. Institutions don't know and should not control what other institutions do with their data. Institutions also share pieces of a single thing and one should not be prevented from publishing just because the other one is.

from dwc-qa.

ymgan commented on September 21, 2024 1

We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number.

I am not sure if this is relevant, but we use the project id in the eml, assuming that 1 dataset only has 1 project of course. GBIF can group these datasets together like this: https://www.gbif.org/dataset/search?project_id=BR~2F154~2FA1~2FRECTO

For datasets with multiple projects, an issue is opened here: gbif/ipt#1780

from dwc-qa.

debpaul commented on September 21, 2024

@ben-norton Scenario 2 is what I'm thinking about (although I'd guess in the situation I'm thinking about, your first scenario is very likely going to happen).

It makes me think hard from a Latimer Core perspective. We're really talking about whole/parts relationships and the many variables around which we might pivot or group data.

So, for a given distributed project, then
a) They would agree to only submit to GBIF once
b) There'd be a field (Grant Number?) around which their data could be grouped.
c) (Or maybe each group in this consortium has a group ID? that goes with the grant number?)

Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing?

d) Even in that case ^^^ The original project would still like to see / grab through the API / visualize their aggregated data.

Where does this leave us? Are there standards in place to help us do this? Or do we have a gap?

from dwc-qa.

debpaul commented on September 21, 2024

Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing?

Yes, please. There is no real way to do this. Institutions don't know and should not control what other institutions do with their data. Institutions also share pieces of a single thing and one should not be prevented from publishing just because the other one is.

Gotcha @Jegelewicz although in this case I'm really talking about projects, not really institutions. We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number. Something like this would be useful. When you start to tease it apart, many specimens will be touched / imaged / sampled / etc in connection to different grants. So it's also a one:many thing. We need a way to group the objects around that grant number...

from dwc-qa.

Jegelewicz commented on September 21, 2024

We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number.

I could see this being covered by the Identifier class in LatimerCore.

from dwc-qa.

debpaul commented on September 21, 2024

We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number.

I could see this being covered by the Identifier class in LatimerCore.

Thanks! It's definitely parallel to the idea of pivoting different parts of the same collection in different ways.

from dwc-qa.

debpaul commented on September 21, 2024

We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number.

I am not sure if this is relevant, but we use the project id in the eml, assuming that 1 dataset only has 1 project of course. GBIF can group these datasets together like this: https://www.gbif.org/dataset/search?project_id=BR~2F154~2FA1~2FRECTO

For datasets with multiple projects, an issue is opened here: gbif/ipt#1780

@ymgan in the scenario I'm describing, various groups across the USA would be collecting data (observations and specimens) on their own in their own areas of the USA. They'd be using a standard protocol. The goal, would be to have all these distributed sets be able to come together using a particular data point. Perhaps this Project ID in the EML could do that then. Does this sound parallel to what you are describing?

from dwc-qa.

ymgan commented on September 21, 2024

Yes, that is parallel to what I am describing. However, it is at dataset level though. For the record level, indeed datasetName and datasetID seem to be for this purpose:

gbif/pipelines#662

@dagendresen made a good remark here: gbif/pipelines#665 (comment)

One important reason or rationale is to group records produced or updated from different project funding. Similar to how the GBIF BID, BIFA, and CESP projects list datasets produced by this project funding. However, often we see project funding for georeferencing, or taxonomic validation and desire to "tag" the data records (or actually ultimately rather desire to "tag" the actual real-world collection specimens) that were georeferenced from a specific project funding --> to credit the funder and track fulfillment of the promise to the funder of e.g. georeferencing 10 000 collection specimens...

from dwc-qa.

dwc:datasetName use to group datasets, yes, no, options please? about dwc-qa HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent