Comments (10)
Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing?
See https://data-blog.gbif.org/post/clustering-occurrences/, which describes what we're already doing and references Nicky's work.
from dwc-qa.
This is a significant issue in camera trapping. Most of the major projects (e.g., eMammal, Snapshot USA, Wildlife Insights) are collaborations between multiple institutions. These are referred to as 'initiatives' since they are larger than one 'organization' or 'institution'. Within those initiatives, providers may submit their own datasets as part of the effort.
Let's say a researcher publishes their camera trap dataset to GBIF. They manage their data in Wildlife Insights. Following the iNaturalist model, Wildlife Insights publishes their data to GBIF. This results in duplicate datasets on GBIF. The Wildlife Insights data will be significantly larger, but that doesn't negate the duplicate issue.
Scenario 2
Let's say a researcher publishes their camera trap dataset to GBIF. They manage their data using Wildlife Insights. Wildlife Insights doesn't publish data to GBIF. The researcher and Wildlife Insights would like to connect the dataset to other Wildlife Insights datasets on GBIF. Here, instead of Wildlife Insights publishing in bulk, a collection of datasets are connected, which as a whole represent the initiative.
from dwc-qa.
Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing?
Yes, please. There is no real way to do this. Institutions don't know and should not control what other institutions do with their data. Institutions also share pieces of a single thing and one should not be prevented from publishing just because the other one is.
from dwc-qa.
We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number.
I am not sure if this is relevant, but we use the project id in the eml, assuming that 1 dataset only has 1 project of course. GBIF can group these datasets together like this: https://www.gbif.org/dataset/search?project_id=BR~2F154~2FA1~2FRECTO
For datasets with multiple projects, an issue is opened here: gbif/ipt#1780
from dwc-qa.
@ben-norton Scenario 2 is what I'm thinking about (although I'd guess in the situation I'm thinking about, your first scenario is very likely going to happen).
It makes me think hard from a Latimer Core perspective. We're really talking about whole/parts relationships and the many variables around which we might pivot or group data.
So, for a given distributed project, then
a) They would agree to only submit to GBIF once
b) There'd be a field (Grant Number?) around which their data could be grouped.
c) (Or maybe each group in this consortium has a group ID? that goes with the grant number?)
Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing?
d) Even in that case ^^^ The original project would still like to see / grab through the API / visualize their aggregated data.
Where does this leave us? Are there standards in place to help us do this? Or do we have a gap?
from dwc-qa.
Or, is it possible, that we (and GBIF) give up worrying about duplicates and instead use the data / metadata to find the dupes as Nicky's algorithms are doing?
Yes, please. There is no real way to do this. Institutions don't know and should not control what other institutions do with their data. Institutions also share pieces of a single thing and one should not be prevented from publishing just because the other one is.
Gotcha @Jegelewicz although in this case I'm really talking about projects, not really institutions. We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number. Something like this would be useful. When you start to tease it apart, many specimens will be touched / imaged / sampled / etc in connection to different grants. So it's also a one:many thing. We need a way to group the objects around that grant number...
from dwc-qa.
We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number.
I could see this being covered by the Identifier class in LatimerCore.
from dwc-qa.
We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number.
I could see this being covered by the Identifier class in LatimerCore.
Thanks! It's definitely parallel to the idea of pivoting different parts of the same collection in different ways.
from dwc-qa.
We've needed, for a long time, for example, a way to say a specific dataset was mobilized via a specific grant number.
I am not sure if this is relevant, but we use the project id in the eml, assuming that 1 dataset only has 1 project of course. GBIF can group these datasets together like this: https://www.gbif.org/dataset/search?project_id=BR~2F154~2FA1~2FRECTO
For datasets with multiple projects, an issue is opened here: gbif/ipt#1780
@ymgan in the scenario I'm describing, various groups across the USA would be collecting data (observations and specimens) on their own in their own areas of the USA. They'd be using a standard protocol. The goal, would be to have all these distributed sets be able to come together using a particular data point. Perhaps this Project ID in the EML could do that then. Does this sound parallel to what you are describing?
from dwc-qa.
Yes, that is parallel to what I am describing. However, it is at dataset level though. For the record level, indeed datasetName and datasetID seem to be for this purpose:
@dagendresen made a good remark here: gbif/pipelines#665 (comment)
One important reason or rationale is to group records produced or updated from different project funding. Similar to how the GBIF BID, BIFA, and CESP projects list datasets produced by this project funding. However, often we see project funding for georeferencing, or taxonomic validation and desire to "tag" the data records (or actually ultimately rather desire to "tag" the actual real-world collection specimens) that were georeferenced from a specific project funding --> to credit the funder and track fulfillment of the promise to the funder of e.g. georeferencing 10 000 collection specimens...
from dwc-qa.
Related Issues (20)
- G
- Representing excluded synonyms in DwC HOT 4
- Tools to write Darwin Core files? HOT 1
- Question to logical structure of DarwinCore classes and properties HOT 5
- native lands HOT 1
- Example, if possible, of a core file that includes both Checklist and Occurrence data? HOT 1
- How to store information of multiple life stages in a museum sample when everything is together HOT 5
- Is there a location for a plain list of dwc terms? HOT 1
- Mixing of DwC class terms in DwCA files
- Need of a field or vocabulary to document endemic status HOT 5
- can I put am or pm in eventTime? HOT 2
- dwc:month examples have text, do I assume we really only want the numeric? HOT 11
- I'm guessing we really only want the numeric, not the text string? yes?
- How can the basisOfRecord for a holotype be HumanObservation?
- Different dynamicProperties fields in event.txt and occurrence.txt - what happens when the tables are joined? HOT 6
- Is there a dwc term (or other tdwg term) for informal group? HOT 5
- Should WoRMS LSID be the value of dwc:taxonID or dwc:scientificNameID in Occurrence core/extension? HOT 20
- Processed variables: concentration of zooplankton HOT 2
- Creating a geospatial range within metadata HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dwc-qa.