iobis / project-team-genetic-data Goto Github PK

Developing guidelines for adding sequence data to OBIS

tdwg darwin-core genetic-data biodiversity-data

project-team-genetic-data's Introduction

Adding Genetic Data to OBIS

Introduction

This repository is the main discussion channel for the OBIS project team on genetic data in 2021. The objective of the project team is to discuss the guidelines required for adding genetic data to OBIS, as well as how OBIS will store, access, and analyse that data.

This project team will work in conjunction with the TWDG task team for Sustainable DarwinCore MIxS Interoperability (https://www.tdwg.org/community/gbwg/MIxS/), and will be utilizing the extension decided on by the community, while providing feedback to the task team through issues discovered through the use cases. In addition, the guidelines in development by GBIF will be reviewed (https://docs.gbif-uat.org/publishing-dna-derived-data/1.0/en/), and OBIS will align its guidelines so that the interoperability between GBIF and OBIS is retained, and provide feedback if any issues are encountered.

Goals and Outcomes

Objectives of the project are to have ready guidelines with use cases to submit to the 10th Session of the SG-OBIS (Nov 2021). Initially, the GBIF guidelines will be reviewed and the DwC-MixS extension will be tested with different use cases. Most importantly however, discussion will be around how OBIS will store genetic data, how this data will be analysed/updated and how different issues will be dealt with.

Part 1

Review GBIF guidelines
Follow DwC-MixS interoperability developments
Review MixS data fields, and how these will be suited for OBIS

Part 2

Discussion and decisions on how sequence data will be dealt with in OBIS
- Will OBIS store sequences or reference to other databases?
- Will OBIS analyse sequence data, i.e. have its own bioinformatics pipeline?
- How will counts be dealt with?
- How will unnamed or cryptic species be dealt with?
- Will taxonomies be updated inside OBIS?
- Will OBIS support data submission through the biom format?
- How will OBIS deal with control data?
- How will OBIS deal with simultaneous analysis of several biomarkers?

Questions and suggestions

The most up-to-date guidelines will be collected in the guidelines folder. Questions on specific issues encountered for datasets should be added to the issues tab.

Materials to help get started

OBIS organized a webinar on genetic data, with an introduction to how OBIS is incorporating data, how genetic data can be accessed and a use case from the first eDNA dataset provided by OBIS-USA. The recording of the webinar can be watched here.

In addition, as the first use case: the data and python scripts used for formatting the first eDNA dataset are available here!

We are always also looking for more use-cases that could be used as examples for adding genetic data to OBIS.

project-team-genetic-data's People

Contributors

Stargazers

Watchers

project-team-genetic-data's Issues

Standardized bioinformatic pipeline

How is it best to register used bioinformatic tool/pipelines?

I understood there are some developments for this in ocean best practices, we should look into that.

Through the PacMAN project, OBIS will also be developing a pipeline, or researching how output from existing pipelines will be formatted for Dwc-A. Is there need for this from other users?

required vs highly recommended fields in the guidelines for metabarcoding data

I work as a bioinformatician at the Hakai Institute and I'm helping to develop best practices for submission of our eDNA experiments to OBIS. I had a look at the metabarcoding guidelines in this repo and I have a suggestion.

Since our eDNA occurrence data is contextualized by the target gene, subfragment, and even forward and reverse primer pair in the same way that a trawl dataset would be contextualized by the holes in their net, shape of the net, depth of the trawl, etc., I expected the gene, subfragment, and f & r primer fields to be required for a submission, but they're only highly recommended.

As someone who might want to use OBIS as a source for occurrence data from sequencing experiments in the future, I'd like to make the suggestion that these fields be required for submission of metabarcoding data. Without that information, it is not possible to know whether absences are due to the target gene/fragment/primers being used, or reflecting the actual absence of a particular organism from the sampling environment.

Incorporation of control data

eDNA data commonly includes a number of control samples - controls from collection, extraction, pcr, etc. The number of reads and taxonomic identifications of any ASVs found in the control samples are a crucial part of the data set, and how researchers choose to use the control data to interpret/filter the rest of the data varies greatly (there isn't an accepted standard practice in the community). Given this, it seems important to be able to include control data in submissions of genetically-derived data to OBIS.

It seems like these data cannot be part of the occurrence file, since any ASVs found do not indicate species presence or absence. Where could they be included as part of a DwC-formatted OBIS submission?

Alternatively, if control data cannot be incorporated, it might be worth discussing whether OBIS wanted to adopt some sort of conservative standard practice, like removing records from all ASVs detected in the controls. This would have to be understood by users and throughout the broader community, though.

Can OBIS help in developing genetic reference databases?

Especially with eDNA data, there is a need for reliable, (possibly local?) genetic reference databases that are used for the taxonomic assignment of sequences. Is there a role that OBIS as a database could have in developing more reliable genetic databases? Obviously we cannot (and do not want to) compete with existing genetic databases, but can we envision possibilities for curation for example?

NCBI taxonomy as a taxonomic authority

Could NCBI taxonomy IDs be useful as taxonomic links?
Where will these be added?

Differentiate between physical occurrences vs. DNA occurrences

Question from meeting:
▪ How can users of the data differentiate between occurrences based on physical samples vs. dna?
▪ Dmitry Schigel of GBIF said that they have been thinking about this, currently this would be recorded/sliced through the BasisOfRecord field, but this is not ideal.
▪ Possibility to use flags on data for the different sources
▪ Concerns about how non-genetic scientists can use the data (Katrina Exter). Let's say someone wants to compares the species from eDNA from project X to those from project Y, how will they know that they are comparing like with like? If the type of sequences are different (e.g. ITS vs 16S) then you will by definition get different species-sets out of the data because they don't look at the same creatures. If project X uses library XX and project Y uses library YY, and they are known to have different levels of accuracy or coverage, then you are not comparing like-to-like. But how can someone know that, if they do not have a background in DNA? Do we expect them to figure it out themselves (which is a option, but then it would be good to add a flag warning people to do this).

We need to make sure that when searching for occurrences, there is a clear separation between DNA-data vs. other occurrence data.

Including protocols

How to include protocols in Dwc-A?

Use of ocean best practices, or protocols.io URI

How will ABS on DSI be ensured in OBIS

This question is on DSI, and how is the use of this data defined/restricted by the Nagoya protocol:

Clarifications from Katrina Exter: ABS-related permits are for
(1) collecting the physical samples by a named someone(s)
(2) the purpose for the collecting. Any digital data derived from the samples (in this case, DwC-A files on OBIS) should include, in their metadata, the ABS permits that were obtained (permit numbers and other info).

But, this information is not necessary for anyone who then comes along and re-uses that digital data. Only if they re-use the physical data do they need to get a new permit. HOWEVER, the scope is changing and digital data may be fall under NP. That, I guess, would mean that definitely the original ABS permit information has to be in the metadata, so someone who wants to reuse the digital data can find out if they need to apply for new ABS permits. Now, whose responsibility it is to ensure that people do not use the data without the correct ABS, I am not sure about it. I have heard both that it is the responsibility of the data provide (e.g. OBIS) and that it is NOT their responsibility. But for sure, providing all necessary information will be OBIS's responsibility as they are the data publisher (no?)

We need to find out how this will exactly work.

Dealing with unnamed/cryptic species

In GBIF it is possible to add OTU/ASV-ID's as an identifyer, but this is not currently possible in OBIS. At the moment we recommend registering the highest possible scientific name. Are there other possibilities?