Giter Site home page Giter Site logo

Comments (10)

Juke34 avatar Juke34 commented on August 28, 2024

We might change this order but I guess it will be only esthetic. I mean the change will affect the order in EMBL flat file but I don't think it will change anything about the ordering in ENA archive.
Well we should contact the helpdesk to confirm how it is handle by their submission pipeline. So the question is, do they keep the order on sequences met in the flat file?
If their is a specific order to follow usually you provide aan AGP file that explain to ENA how to scaffold those contigs. (AGP file is sort of recipe that say how to concatenate the different contigs (order and direction) to create a long scaffold).

Anyway we could try to see if changing the EMBLmyGFF3 behavior is an easy task. If so we will change that.

from emblmygff3.

ireneortega avatar ireneortega commented on August 28, 2024

Yes, it's just a question of esthetic. I have no idea what is the order they keep in the ENA archive, but I received the .gz file after submitting the annotated sequences and the order is the same as in the flat file. Honestly, I don't know if that .gz file will be the one that will be released to public as this is my first time submitting to ENA. I only uploaded the assembly (only interested in the contig level, not scaffolds) and the sequence annotations. In fact, I didn't know about the AGP file. Should I upload this file too? Is this format correct for that file? I haven't seen this file before...

contig1	1	102565	1	W	OV1234	1	102565	+
contig2	1	88529	1	W	OV1235	1	88529	+
contig3	1	18341	1	W	OV1236	1	18341	+

Thanks for your help!

from emblmygff3.

Juke34 avatar Juke34 commented on August 28, 2024

If order matter yes you should upload an AGP file. See the dedicated section in the ENA help:
https://ena-docs.readthedocs.io/en/latest/submit/fileprep/assembly.html#agp-file
and NCBI https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/
and examples at the bottom as here: https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/scaffold_from_contig_WGS.agp.v2.0/

Loot at biostars.org, there were several questions related to AGP files.

You can try the validator then to be sure if it is well done: https://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/agp_validate.cgi

from emblmygff3.

Juke34 avatar Juke34 commented on August 28, 2024

I got an answer: Order matter:

The accession numbers will be assigned to the sequences in the exact order it
was included in the flat file submission. 

So we need to update EMBLmyGFF3 to fix this problem

from emblmygff3.

Juke34 avatar Juke34 commented on August 28, 2024

@ireneortega I would need some feedbacks from you to close this issue. I have tried on my side and EMBLmyGFF3 v2.1 sounds to work as expected. Did you use an older version? Could you tell my your python version, EMBLmyGFF3 version and biopython version? Otherwise could you try with EMBLmyGFF3 v2.1 to see if you see the same problem?

from emblmygff3.

ireneortega avatar ireneortega commented on August 28, 2024

@Juke34 My contig ordering is not kept in the EBML file in the same way as in the fasta and gff files, so contigs are ordered in the way I told you at the beginning. After assembling, contigs are named as contig1, contig2, etc. and then contigs are reordered, imagine this way: contig14, contig1, contig3, etc, so I want the EMBL shows the contigs in this same way. What I got is: contig10, contig11...contig19, contig1, contig20...

I am using EMBLmyGFF3 v2.1 with Python 2.7.18 and biopython 1.76.

from emblmygff3.

Juke34 avatar Juke34 commented on August 28, 2024

Ok then it wiould be fixed if you install python >=3.6

Since Python 3.6, the default dict class maintains key order, meaning this dictionary will reflect the order of records given to it. As of Biopython 1.72, on older versions of Python we explicitly use an OrderedDict so that you can always assume the record order is preserved.

I made a try with this order

ERS324955|SC|contig000001
ERS324955|SC|contig000012
ERS324955|SC|contig000011
ERS324955|SC|contig000003

The order from the fasta is kept by seq_dict = SeqIO.to_dict( SeqIO.parse(infasta, "fasta") ) (still 1,12,11,3) but then when parsing the GFF with for record in GFF.parse(infile, base_dict=seq_dict): the order become

ERS324955|SC|contig000001
ERS324955|SC|contig000003
ERS324955|SC|contig000011
ERS324955|SC|contig000012

I made a try moving around order of GFF feature, the final order is still respected. So updating python would fix the problem.
I the next release I will force new version of python and biopython and add a test to check that order behavior is respected.

from emblmygff3.

ireneortega avatar ireneortega commented on August 28, 2024

I've just installed EMBLmyGFF3 through conda with python 3.6 and the same problem appeared (contig1, contig10, contig11...).

from emblmygff3.

Juke34 avatar Juke34 commented on August 28, 2024

Try with branch 2.2.

  1. Git clone the repo
  2. move into the 2.2 branch (git checkout 2.2)
  3. There is a conda file you can use to prepare the env conda env create -f conda_environment_AGAT.yml , then conda activate emblmygff3
  4. Install EMBLmyGFF3 with python setup.py install

Then it should work properly.

from emblmygff3.

Juke34 avatar Juke34 commented on August 28, 2024

Please feel free to reopen the issue if you still encountered problem in v2.2 of EMBLmyGFF3

from emblmygff3.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.