Comments (10)
We might change this order but I guess it will be only esthetic. I mean the change will affect the order in EMBL flat file but I don't think it will change anything about the ordering in ENA archive.
Well we should contact the helpdesk to confirm how it is handle by their submission pipeline. So the question is, do they keep the order on sequences met in the flat file?
If their is a specific order to follow usually you provide aan AGP file that explain to ENA how to scaffold those contigs. (AGP file is sort of recipe that say how to concatenate the different contigs (order and direction) to create a long scaffold).
Anyway we could try to see if changing the EMBLmyGFF3 behavior is an easy task. If so we will change that.
from emblmygff3.
Yes, it's just a question of esthetic. I have no idea what is the order they keep in the ENA archive, but I received the .gz file after submitting the annotated sequences and the order is the same as in the flat file. Honestly, I don't know if that .gz file will be the one that will be released to public as this is my first time submitting to ENA. I only uploaded the assembly (only interested in the contig level, not scaffolds) and the sequence annotations. In fact, I didn't know about the AGP file. Should I upload this file too? Is this format correct for that file? I haven't seen this file before...
contig1 1 102565 1 W OV1234 1 102565 +
contig2 1 88529 1 W OV1235 1 88529 +
contig3 1 18341 1 W OV1236 1 18341 +
Thanks for your help!
from emblmygff3.
If order matter yes you should upload an AGP file. See the dedicated section in the ENA help:
https://ena-docs.readthedocs.io/en/latest/submit/fileprep/assembly.html#agp-file
and NCBI https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/
and examples at the bottom as here: https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/scaffold_from_contig_WGS.agp.v2.0/
Loot at biostars.org, there were several questions related to AGP files.
You can try the validator then to be sure if it is well done: https://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/agp_validate.cgi
from emblmygff3.
I got an answer: Order matter:
The accession numbers will be assigned to the sequences in the exact order it
was included in the flat file submission.
So we need to update EMBLmyGFF3 to fix this problem
from emblmygff3.
@ireneortega I would need some feedbacks from you to close this issue. I have tried on my side and EMBLmyGFF3 v2.1 sounds to work as expected. Did you use an older version? Could you tell my your python version, EMBLmyGFF3 version and biopython version? Otherwise could you try with EMBLmyGFF3 v2.1 to see if you see the same problem?
from emblmygff3.
@Juke34 My contig ordering is not kept in the EBML file in the same way as in the fasta and gff files, so contigs are ordered in the way I told you at the beginning. After assembling, contigs are named as contig1, contig2, etc. and then contigs are reordered, imagine this way: contig14, contig1, contig3, etc, so I want the EMBL shows the contigs in this same way. What I got is: contig10, contig11...contig19, contig1, contig20...
I am using EMBLmyGFF3 v2.1 with Python 2.7.18 and biopython 1.76.
from emblmygff3.
Ok then it wiould be fixed if you install python >=3.6
Since Python 3.6, the default dict class maintains key order, meaning this dictionary will reflect the order of records given to it. As of Biopython 1.72, on older versions of Python we explicitly use an OrderedDict so that you can always assume the record order is preserved.
I made a try with this order
ERS324955|SC|contig000001
ERS324955|SC|contig000012
ERS324955|SC|contig000011
ERS324955|SC|contig000003
The order from the fasta is kept by seq_dict = SeqIO.to_dict( SeqIO.parse(infasta, "fasta") )
(still 1,12,11,3) but then when parsing the GFF with for record in GFF.parse(infile, base_dict=seq_dict):
the order become
ERS324955|SC|contig000001
ERS324955|SC|contig000003
ERS324955|SC|contig000011
ERS324955|SC|contig000012
I made a try moving around order of GFF feature, the final order is still respected. So updating python would fix the problem.
I the next release I will force new version of python and biopython and add a test to check that order behavior is respected.
from emblmygff3.
I've just installed EMBLmyGFF3 through conda with python 3.6 and the same problem appeared (contig1, contig10, contig11...).
from emblmygff3.
Try with branch 2.2.
- Git clone the repo
- move into the 2.2 branch (
git checkout 2.2
) - There is a conda file you can use to prepare the env
conda env create -f conda_environment_AGAT.yml
, then conda activate emblmygff3 - Install EMBLmyGFF3 with
python setup.py install
Then it should work properly.
from emblmygff3.
Please feel free to reopen the issue if you still encountered problem in v2.2 of EMBLmyGFF3
from emblmygff3.
Related Issues (20)
- How to add in comment or CC line HOT 1
- Webin-CLI validation failing due to duplicated feature locations in EMBLmyGFF3 flat file HOT 2
- thank you HOT 1
- ImportError Bio.Alphabet error HOT 1
- What the option "-a accesion" parameter should to be set(means:What type?) HOT 3
- TypeError: read() takes 1 positional argument but 2 were given? HOT 3
- attribute formats broken across several lines HOT 2
- Use of example data HOT 2
- Warning qualifier unknown db_xref HOT 1
- Translation problem HOT 2
- Reporting the line number of the problematic input GFF3 files when parsing error is triggered HOT 2
- BioPython 1.81 installs via conda, but needs an older version HOT 2
- if I can specify certain python source during python setup.py install HOT 2
- Not for ENA submission: Sequence too short
- Installation issue EMBLmyGFF3 & python version requirements
- Translation when circular genome and ORF in the cut of the assembly HOT 1
- unexpected keyword argument 'strand' HOT 3
- Gene sorting compared to fasta HOT 4
- Bug when specifying -g mitochrondrion (or plastid)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from emblmygff3.