nbisweden / emblmygff3 Goto Github PK

An efficient way to convert gff3 annotation files into EMBL format ready to submit.

License: GNU General Public License v3.0

Python 99.09% Shell 0.91%

embl gff3 converter submission ena

emblmygff3's Issues

Spread feature like CDS are not collectively linked when L1 and/or L2 feature missing

If l1 (e.g. gene) and l2 feature (e.g. mRNA) are missing for several CDS that must be collectively linked (one CDS several position in the EMBL file), the tool create one EMBL CDS feature per GFF CDS feature.
Would be nice to deal with that.
To deal with such case, currently we need to run agat_sp_gxf_to_gff.pl from AGAT to create the missing L1, L2 features.

Issue with option --organelle

Traceback (most recent call last):
  File "/opt/6.x/python-2.7.2/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/opt/6.x/python-2.7.2/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/projet/fr2424/sib/lgueguen/git/gmod/EMBLmyGFF3/EMBLmyGFF3/__main__.py", line 4, in <module>
    main()
  File "/projet/fr2424/sib/lgueguen/git/gmod/EMBLmyGFF3/EMBLmyGFF3/EMBLmyGFF3.py", line 1277, in main
    writer.set_organelle( args.organelle )
  File "/projet/fr2424/sib/lgueguen/git/gmod/EMBLmyGFF3/EMBLmyGFF3/EMBLmyGFF3.py", line 964, in set_organelle
    organelle = self._verify( self.organelle, "organelle")
AttributeError: 'EMBL' object has no attribute 'organelle'

Locus tag clarification

Hi,

Thanks for developing this tool. It would be very useful for me.

Regarding the requirement to write the locus tag in the command (--locus_tag MY_LOCUS_TAG), do you mean the locus tag prefix, as described here? So, for Caenorhabditis elegans, this would be CELE because all gene features start with CELE?

Confusion with the locus tag

Hi,

There are some rules for EBI about the locus tag:

https://ena-docs.readthedocs.io/en/latest/faq/locus_tags.html

I used your script to create a EMBL flat file but for each locus, _LOCUSXX (with XX a number) is added after my prefix, example : PRE_LOCUSXX

In their example, a locus tag would be /locus_tag="BN5_00001". Therefore, is PRE_LOCUS will be considered as a prefix by EBI and then refused because of this rule ?

All characters must be alphanumeric with none such as -_*

Error in translations

issue reported by an user:

When i used the option --translate, some CDSs were translated error in embl file. For instance

In the gff3
Chr_1 AUGUSTUS        gene    55249   56486   0.84    -       .       ID=g13
Chr_1 AUGUSTUS        transcript      55249   56486   0.84    -       .       ID=g13.t1;Parent=g13
Chr_1 AUGUSTUS        stop_codon      55249   55251   .       -       0       Parent=g13.t1
Chr_1 AUGUSTUS        intron  55679   55753   0.95    -       .       Parent=g13.t1
Chr_1 AUGUSTUS        intron  55904   55957   0.94    -       .       Parent=g13.t1
Chr_1 AUGUSTUS        intron  56015   56069   1       -       .       Parent=g13.t1
Chr_1 AUGUSTUS        intron  56228   56296   1       -       .       Parent=g13.t1
Chr_1 AUGUSTUS        intron  56394   56472   0.99    -       .       Parent=g13.t1
Chr_1 AUGUSTUS        CDS     55249   55678   0.91    -       1       ID=g13.t1.cds;Parent=g13.t1
Chr_1 AUGUSTUS        CDS     55754   55903   0.94    -       1       ID=g13.t1.cds;Parent=g13.t1
Chr_1 AUGUSTUS        CDS     55958   56014   0.94    -       1       ID=g13.t1.cds;Parent=g13.t1
Chr_1 AUGUSTUS        CDS     56070   56227   1       -       0       ID=g13.t1.cds;Parent=g13.t1
Chr_1 AUGUSTUS        CDS     56297   56393   0.99    -       1       ID=g13.t1.cds;Parent=g13.t1
Chr_1 AUGUSTUS        CDS     56473   56486   1       -       0       ID=g13.t1.cds;Parent=g13.t1
Chr_1 AUGUSTUS        start_codon     56484   56486   .       -       0       Parent=g13.t1
# protein sequence = [MSLIRDSGPRRLVDGFWEYGRYYGSWRPRKYLFPIDAEELNRMDIFHKFFLVARDEALFASPLDPNRDQPLRILDLGT
#GTGIWAINVAEVTAVPPEIMVVDLHQIQPALIPLGISPLQFDIEEASWEPLMKDCDLVHIRMLYGSIQTDLWPDIYHKTFEHLKPGSGYIEHIEIDWV
#PRWDGNDVPPESSLHEWSQLLLRGLDRFNRNARIDVGEVRITLDKAGFVDFREETIRCYVNPWSSERREREIARWFNLGLSQCLEAMSLMPMIEGLSM
# TKEQVKELCDRAKKEICILRYHAYMTL]


In the converted embl
CDS             complement(join(55249..55678,55754..55903,55958..56014,
FT                   56070..56227,56297..56393,56473..56486))
FT                   /locus_tag="LOCUSTAG_LOCUS13"
FT                   /codon_start=1
FT                   /note="source:AUGUSTUS"
FT                   /note="ID:g13.t1.cds"
FT                   /translation="CP*LEIQGQGVL*MDFGSMAGIMAHGDRGSICSRLTRRNLTGWTS
FT                   FTSSS*LLETKLYLPPHWTRTGTNPFEYLILELVPEYGPLMLQK*LLFHRRSWLWISIR
FT                   FSQPSFPSVFLPYNLTSKKHHGSL**KIATWCTYECSMAVSRPICGQIYTIKLSNI*SL
FT                  GLDT*NTLKSIGCPGGMETTSRPSHRCMNGPSYYCEAWIVSTGMPELMWGKFE*PSTRP
FT                   GSSISEKRPFGAT*THGPRSVVSGKLRDGSTSGFLNVSRR*V*CP**RG*V*PKNKSRS
FT                   SVTGPKRRFAYCAITLI*RC"
FT                   /transl_table=1

Any suggestions?

Thanks
Edison

Require help in mapping GFF3 type (Column3) to EMBL qualifier

Hello! I have many ncRNA features (lncRNA,snoRNA,snRNA, etc.) in my GFF3 file. According to your instructions I included the following in the translation_gff_feature_to_embl_feature.json file:

"ncRNA_gene": {
"target": "ncRNA",
},
"snoRNA": {
"target": "ncRNA",
},
"lnc_RNA": {
"target": "ncRNA",
},

I am getting a new warning saying,

WARNING feature: The qualifier >ncRNA_class< is mandatory for the feature >ncRNA<. We will not report the feature.

I'm not quite sure how (and where) to add this qualifier ncRNA_class, seeing that it changes according to feature, ie., if the feature is a lnc_RNA, then it will map to ncRNA in the EMBL file but the ncRNA_class will be lncRNA, snoRNA -> ncRNA_class:snoRNA etc.

Converts to embl but with interrogation marks in SQ

I'm working with yeasts and I really need their embl files, but when i run the programm (using bash and python), i encounter with the next problem:
The current warnings pops on the terminal (though I dont think they are the cause of the problem):

17:25:17 ERROR feature: >>trna<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.
Otherwise tell me which EMBL feature it corresponds to by adding the information within the json mapping file.
17:25:17 WARNING feature: Unknown qualifier 'NAME' - skipped
17:25:17 ERROR feature: >>trna_exon<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.
Otherwise tell me which EMBL feature it corresponds to by adding the information within the json mapping file.
17:25:30 ERROR feature: >>UTR<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.
Otherwise tell me which EMBL feature it corresponds to by adding the information within the json mapping file.
Conversion done

And, as it says on the final line, the conversion is done, when I open the generated embl the features are fine, but the sequence is all interrogation marks.

FT /transl_table=12
XX
SQ Sequence 2596028 BP; 0 A; 0 C; 0 G; 0 T; 2596028 other;
?????????? ?????????? ?????????? ?????????? ?????????? ?????????? 60
?????????? ?????????? ?????????? ?????????? ?????????? ?????????? 120

And if I keep scrolling, it is as if the conversion had started again:

 ????????                                                            2596028

//
ID XXX; XXX; linear; genomic DNA; XXX; XXX; 2596667 BP.
XX
AC XXX;
XX
AC * SOME_YEAST
XX
PR Project:XXX;

After that the only existing feature is "gap" and the sequence (SQ) is now like it is suppossed to be:
FT gap 2556681..2556981
FT /estimated_length=301
XX
SQ Sequence 2596667 BP; 806943 A; 475017 C; 477105 G; 804281 T; 33321 other;
AATCTGCTCA GTAAGGCCCA TAAATCGGCT CTGCATTTCT TCTGTGGGCA TTTTGCCGTA 60
CTTTTTTAAT TATGTTGCAG ACGAAACTGA ATCAAGCTCG TCGACAGCTT CGTACAGCCT 120

I have no idea why this would happen, I really hope you can help me figure out what is happening,
I really need those emb files.

Implement a progress bar

It would be nice to implement a progress bar. Using tqdm ?

thank you

I don't have an issue, I am just really grateful you wrote this!

I was so frustrated trying to generate a gap file and it was so much easier to convert it directly with the annotation to EMBL format using your scrpit! So, thank you!

Convert embl to gbk

Hi Jacques
Could you please add some functions to convert embl to gbk?

Best
Edison

Programme failed: ... KeyboardInterrupt Terminated

Hi,

I've run EMBLmyGFF3 on a cluster in the following way:

EMBLmyGFF3 c_elegans.PRJNA13758.WS263.annotations.gff3 c_elegans.PRJNA13758.WS263.genomic.fa --topology linear --molecule_type 'genomic DNA' --transl_table 1 --species 'Caenorhabditis elegans' --locus_tag CELE --project_id PRJNA13758 -o c_elegans.PRJNA13758.WS263.annotations.embl

I have checked that I have Python 2.6, Biopython 1.67 and bcbio-gff 0.6.4. I can successfully pull out the help from EMBLmyGFF3.

I got the following output from my command above:

Traceback (most recent call last):
  File "/nfs/users/nfs_c/user/anaconda3/envs/python2env/bin/EMBLmyGFF3", line 11, in <module>
    load_entry_point('EMBLmyGFF3==1.2.3', 'console_scripts', 'EMBLmyGFF3')()
  File "/nfs/users/nfs_c/user/anaconda3/envs/python2env/lib/python2.7/site-packages/EMBLmyGFF3-1.2.3-py2.7.egg/EMBLmyGFF3/EMBLmyGFF3.py", line 1275, in main
    for record in GFF.parse(infile, base_dict=seq_dict):
  File "build/bdist.linux-x86_64/egg/BCBio/GFF/GFFParser.py", line 742, in parse
  File "build/bdist.linux-x86_64/egg/BCBio/GFF/GFFParser.py", line 322, in parse_in_parts
  File "build/bdist.linux-x86_64/egg/BCBio/GFF/GFFParser.py", line 343, in parse_simple
  File "build/bdist.linux-x86_64/egg/BCBio/GFF/GFFParser.py", line 637, in _gff_process
  File "build/bdist.linux-x86_64/egg/BCBio/GFF/GFFParser.py", line 667, in _lines_to_out_info
  File "build/bdist.linux-x86_64/egg/BCBio/GFF/GFFParser.py", line 189, in _gff_line_map
  File "build/bdist.linux-x86_64/egg/BCBio/GFF/GFFParser.py", line 89, in _split_keyvals
KeyboardInterrupt
Terminated

I submitted the command (in a .sh file) as a batch job with 20 MB memory on a GNU/Linux machine.

Update for Python 3 compliant

It would be nice to make the tool python 3 compliant.

Feature Request: locus_tag parameter

Hi,

thank you for this great tool! It is of great help!

I would like to ask if it would be possible to introduce a new parameter with which the tool uses an already existing attribute as locus_tag, e.g. the ID or just locus_tag itself.
Thank You
Best Regards
Nadine

Add warning for unknown sequence

when a sequence from the gff is not found within the provided fasta file, it will create string of ???? as sequence. the length will be related to the end position of the last feature of the missing sequence.

Would be nice to inform the user that a potential de-synchronization of the sequence names occurred between the gff and the fasta file.

In top of that, using --translate option will raise an error due to ??? codon does not exist. So if a warning before that error is displayed it could help the user to understand its problem.

Another Question: Unused Stop Codon in mitochondrial DNA provokes Error

Hi,

I've stumbeled upon another thing. In the mithochondrium of our organism not all stop codons are used as such. More precisely TGA is used to code for tryptophan and not as a stop codon. So each time a gene has a TGA inside I get the error "Stop codon found within the CDS...". Would it be possible to exclude certain stop (or even start) codons?

Thanks in advance

Best Regards,
Nadine

When no strand for CDS by default the negative strand is used

When no strand for CDS by default the negative strand is used. Probably More coherent to be plus strand by default.
2)We must raise an error when a CDS does not have any strand information.
This is awkward but tool like Ugene can create CDS wihtout strand.

Is the option --locus_numbering_start working?

Hi,

It seems to me that the --locus_numbering_start parameter is not working.
I provide an integer for this parameter (e.g. --locus_numbering_start 30) but it is not taken into consideration.
Could this be related to a 10 step increment in my gff3 input file coming from Prokka?
Thanks in advance, best

Warning regarding protein ID

Hello again. I'm getting the following warning:

WARNING feature: The value(s) ['AAEL012102-PB'] is(are) invalid for the qualifier protein_id of the feature CDS. We will not report the qualifier. (Here is the regex expected: [a-zA-Z]{3}[0-9]{5}\.[0-9]+)

I guess the hyphen in the name is causing an issue? All the protein IDs in my GFF3 file have a hyphen and end up triggering this error (after a point the program just gets tired of them and quits printing them). I would like to preserve this information in my EMBL file, can you suggest a fix?

question: what additional options for mitochondria

I have a translation table of 4 for mitochondria but when create embl file it says there is a conflict between species translation table 1, how do I set organelle to get round this?

organelle = self._verify( self.organelle, "organelle")
AttributeError: 'EMBL' object has no attribute 'organelle'

UnboundLocalError: local variable 'new_value' referenced before assignment

Hi,

in testing EMBLmyGFF3 to prepare an embl file with genome annotation for submission to EMBL, I'm getting the following error:

EMBLmyGFF3 scaffold1.gff scaffold1.fa --topology linear --transl_table 1 --molecule_type 'genomic DNA' --species 'Salix viminalis' --locus_tag TEST --project_id PRJEB00001 --de 'Single-molecule assembly' -o scaffold1.embl

    #############################################################################
    # NBIS 2018 - Sweden                                                        #
    # Authors: Martin Norling, Niclas Jareborg, Jacques Dainat                  #
    # Please visit https://github.com/NBISweden/EMBLmyGFF3 for more information #
    #############################################################################

12:25:01 WARNING feature: Unknown qualifier 'makerName' - skipped              ]
12:25:01 WARNING feature: Unknown qualifier '_QI' - skipped
12:25:01 WARNING feature: Unknown qualifier '_AED' - skipped
12:25:01 WARNING feature: Unknown qualifier '_eAED' - skipped
12:25:01 ERROR qualifier: local variable 'new_value' referenced before assignment
Traceback (most recent call last):
  File "/opt/pyenv/versions/2.7.10/envs/EMBLmyGFF3_venv/lib/python2.7/site-packages/EMBLmyGFF3/modules/qualifier.py", line 88, in _by_value_format
    formatted_value=new_value
UnboundLocalError: local variable 'new_value' referenced before assignment

I'm just using one scaffold for testing purposes. The head of the GFF file looks like this:

##gff-version 3
scaffold1	repeatmasker	match	2	581	896	+	.	ID=scaffold1:hit:709:1.3.0.0;Name=species:rnd-4_family-62|genus:Unspecified;Target=species:rnd-4_family-62|genus:Unspecified 253 760 +
scaffold1	maker	gene	444	3240	.	+	.	ID=salix_viminalisG00000000001;Name=at3g47200_37;makerName=genemark-scaffold1-processed-gene-0.6
scaffold1	maker	mRNA	444	3240	.	+	.	ID=salix_viminalisT00000000001;Parent=salix_viminalisG00000000001;Dbxref=PFAM:PF03140,InterPro:IPR004158;Name=at3g47200_37;_AED=0.25;_QI=0|0|0|0.25|1|1|4|0|412;_eAED=0.25;makerName=genemark-scaffold1-processed-gene-0.6-mRNA-1;product=UPF0481 protein At3g47200
scaffold1	maker	exon	444	618	.	+	.	ID=salix_viminalisE00000000001;Parent=salix_viminalisT00000000001;makerName=genemark-scaffold1-processed-gene-0.6-mRNA-1:1
scaffold1	maker	CDS	444	618	.	+	0	ID=salix_viminalisC00000000001;Parent=salix_viminalisT00000000001;makerName=genemark-scaffold1-processed-gene-0.6-mRNA-1:cds
scaffold1	maker	exon	827	912	.	+	.	ID=salix_viminalisE00000000002;Parent=salix_viminalisT00000000001;makerName=genemark-scaffold1-processed-gene-0.6-mRNA-1:2
scaffold1	maker	CDS	827	912	.	+	2	ID=salix_viminalisC00000000001;Parent=salix_viminalisT00000000001;makerName=genemark-scaffold1-processed-gene-0.6-mRNA-1:cds
scaffold1	maker	exon	2167	3072	.	+	.	ID=salix_viminalisE00000000003;Parent=salix_viminalisT00000000001;makerName=genemark-scaffold1-processed-gene-0.6-mRNA-1:3
scaffold1	maker	CDS	2167	3072	.	+	0	ID=salix_viminalisC00000000001;Parent=salix_viminalisT00000000001;makerName=genemark-scaffold1-processed-gene-0.6-mRNA-1:cds

Note that I have a local translation_gff_feature_to_embl_feature.json mapping "match" to "repeat_region":

 "match": {
 	"target": "repeat_region"
 },

The head of the fasta file is as:

>scaffold1
TAAAATAAAAAAAATCGGGTCGGGCCCAACTAAATGGGCCGGCTAGCCAAGATGGGCCAAAGCCCAATTTTAATGGGCTGGGCCGAGAGCCGTCCAGCCC
AAGACACGCAGAAGAGGAAAAAAAGAAAAGGGCAAAACGGCACTGTTTAGCACATGTTAATTAAACAGTTTACGTATTTCGTGAACAGTAAAATGGTGGT
CGGCCGACCACGACGAGAGGGTCACCTGCTATTGCCGCCGGGTAGAGGAGGTCGAGGTGGTTGTCCCTGTGGTTGTGGAGTCGAAAATGGTGGCCCGTGG
CGGCCGGAGGAGGCGTTGGAAGTGGCCGGTCTGTTGCTTGCTGTCGTCGGGGCTGTCACTGTTTCTTCGCCGGAGAGGACGACTGAGCTGCTGGAGTTGA
GGGGAGGCTGAAGGTGGTGATGAGGGTGGATATGGGTGGTTGAATGGTGGCTGTTGGAGGAGAGAGAGAGACGCCGGGTCCTCTGGTTTTAGAGAGAGAA
TGCTGTCGGGGAGAGAGAAAGGAGCTGCAACAGGCTGAGAGACGAAGGAGAGAGAGAGAGAAAGGGCTGCTGTGTCGCCGGAGCTGGAGAGGAAGAAAGG
GTGGCTGCCTCTGCGTGTGTATGCTTGTGTTCTGCAAATTTACCACGTCTTCGTCTTCCTCCTCCAGCCTTAATTTGAAACTGAAACTAAAATATTCGCC
TCTGTTCTCTCAAAACTTCTCAGTTTCTTCCTTGCTTTTCTTTGCCCAAATTTCTGTCGATTTTCCTCCCGTTTTTTCTCCCTTCTTTCTCCCCCTTCTG
CATGCATTTCATGCATGTATTTATAGGTTTGAAAGGCAACCCTTCAGCTGCCCATGGCGTGCAGCGAAGGGTTGCCGCCTGTGATTGCAGGTGGCGTGCC

Also, I installed EMBLmyGFF3 in a virtual environment, but when I run the maker example with the command EMBLmyGFF3-maker-example, I do get the correct output EMBLmyGFF3-maker-example.embl and without errors.

I can't seem to spot possible formatting errors in the input files but I'm just using this tool for the first time so there may be something that I'm missing.

Any help would be kindly appreciated.
Many thanks,
Pedro

Improve -i / --locus_tag option

The following points should be checked on the locus tag prefix given with option -i / --locus_tag (required when the locus-tag is registered at ENA):

A locus tag prefix must have the following format:

starts with a letter
is at least 3 characters long
is upper case
contains only alpha-numeric characters and no symbols such as -_*

ID not taken as locus tag

Hi, excited to see something that may make things easier. bit of a nightmare otherwise. I have ID's within my Gff and was expecting them to be used for the locus tags but they are not and sequential numbers are instead. A note is created of the ID which I think would be better if just the locus_tag became the ID as I think that is it's purpose. I don't have gene names in gff but ideally a tab file of gene names could be given to add these to the resulting embl file too, as this is the likely starting point of having gene names available. I would also like to parse the exon number after the : and add this in, although I don't think this is essential. I'm still trying to work out the ENA format requirements for submission. I think I could just have a locus tag as the minimum feature and what I'm working towards. The webin validation tool complains about overlapping UTR and CDS features of two genes in the same direction. Could a correction part be added to cleave UTR and correct gene when detects this? As I have to work out how to fix this and start again. I know of a script somewhere that will do the cleaving of UTR at least. Sorry a few change requests or otherwise I'll try to make the changes myself when time but harder when don't know the code.

FT mRNA join(433449..433533,433946..434073,434612..434836,
FT 435438..435904)
FT /locus_tag="SPEXI_LOCUS1"
FT /note="source:maker"
FT /note="ID:SPEXI_01T000001"
FT CDS join(433449..433533,433946..434073,434612..434836,
FT 435438..435710)
FT /locus_tag="SPEXI_LOCUS1"
FT /note="source:maker"
FT /note="ID:SPEXI_01T000001:cds"
FT /transl_table=1
FT exon 433449..433533
FT /locus_tag="SPEXI_LOCUS1"
FT /note="source:maker"
FT /note="ID:SPEXI_01T000001:1"

Joining of intron features

Currently EMBLmyGFF3 joins intron entries. Here is an example of joined introns:

FT   intron          join(6625..6675,6797..6841,6924..6966,7119..7161,
FT                   7245..7286,7423..7476,7630..7673,7750..7962,8110..8158,
FT                   8225..8265,8365..8407)
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"

As this does not make sense biologically, this issue should be fixed in later versions.

bcbio not found

Hi,

I've followed the instructions to install EMBLmyGFF3 with git on Mac. I got the following error:

Download error on https://pypi.org/simple/bcbio-gff/: [Errno 54] Connection reset by peer -- Some packages may not be found! Couldn't find index page for 'bcbio-gff' (maybe misspelled?) Scanning index of all packages (this may take a while) Reading https://pypi.org/simple/ Download error on https://pypi.org/simple/: [Errno 54] Connection reset by peer -- Some packages may not be found! No local packages or working download links found for bcbio-gff==0.6.4 error: Could not find suitable distribution for Requirement.parse('bcbio-gff==0.6.4')

Does this mean I have to install bcbio myself before installing EMBLmyGFF3?

Cheers.

Features locations are duplicated - consider merging qualifiers

Thanks for this nice tool. I'm running into an issue trying to validate embl files that were generated on your tool. I'm using webin-cli-1.7.1 and it throws up the below error when I try to validate/submit the embl files

ERROR: "tRNA" Features locations are duplicated - consider merging qualifiers.

The command-line I used is this:

EMBLmyGFF3 test/6666666.419437.gff test/6666666.419437.contigs.fa -o test/test_new.embl

Any help in this regard would be highly appreciated

Bio.Alphabet has been removed from Biopython

Hi there,

I installed EMBLmyGFF3 via conda. When I try to run it, I get an error:

Traceback (most recent call last):
  File "/home/tagirdzh/miniconda/envs/EMBLmyGFF3/bin/EMBLmyGFF3", line 6, in <module>
    from EMBLmyGFF3 import main
  File "/home/tagirdzh/miniconda/envs/EMBLmyGFF3/lib/python3.8/site-packages/EMBLmyGFF3/__init__.py", line 3, in <module>
    from .EMBLmyGFF3 import *
  File "/home/tagirdzh/miniconda/envs/EMBLmyGFF3/lib/python3.8/site-packages/EMBLmyGFF3/EMBLmyGFF3.py", line 4, in <module>
    from .modules.feature import Feature
  File "/home/tagirdzh/miniconda/envs/EMBLmyGFF3/lib/python3.8/site-packages/EMBLmyGFF3/modules/feature.py", line 10, in <module>
    from Bio.Alphabet.IUPAC import *
  File "/home/tagirdzh/miniconda/envs/EMBLmyGFF3/lib/python3.8/site-packages/Bio/Alphabet/__init__.py", line 20, in <module>
    raise ImportError(
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the ``molecule_type`` as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

Thanks!

WHy can't we use biopython > 1.67 h

emblmygff3 1.2.3 has requirement biopython==1.67, but you'll have biopython 1.72 which is incompatible.

Feature qualifier values wrapped across multiple lines

Hello,

Thanks for making a very useful tool, I'm glad to have come across it.

There is an issue I'm having with line-wrapping. EMBLmyGFF3 appears to be wrapping lines at 80 characters. This creates problems with longer qualifier values, e.g. product names, because they are broken across several lines, often in the middle of a word. When one later runs the ENA flat file validator tool with the "fix" option, it unwraps the line but adds a space, so now the unwrapped name is broken.

Would it be possible to add an option to turn off line wrapping? Thank you!

-- Brandon

Question: How to add more than 1 publication

Hi,

I need to add more than one publication, how would this be possible? I already tried to reuse the --ra, --rt, --rl parameter for each publication, but only the last one will be used. Of course I could do it manually. but the RP field won't be filled automatically anymore and I would have to do it severeal times which can be very time consuming.

Thank you very much in advance.

Best Regards,
Nadine

Annotation description is missing in output file

Hi,
I really like the tool but, there is a little problem when I convert my .gff file to .embl file. I cannot see the annotation information in the output file format. Rest it seems okay. I am copying a subset of my data as text file in this email for your review. Can you please comment?

subset_embl.txt

annotated gff3.txt

--translate option does not work anymore

line 607 of feature.py has to be replaced by
translated_seq = str(seq.translate(codon_table)).replace('B','X').replace('Z','X').replace('J','X')

string index out of range (when sequence end by Ns)

Original question from @Iseez
Just one question more, when i was tryng to obtain the embl for a different species i encountered the following error:

Traceback (most recent call last):                                             ]
  File "/cm/shared/apps/emblmygff3/1.2.6/bin/EMBLmyGFF3", line 11, in <module>
    load_entry_point('EMBLmyGFF3==1.2.6', 'console_scripts', 'EMBLmyGFF3')()
  File "/cm/shared/apps/emblmygff3/1.2.6/lib/python2.7/site-packages/EMBLmyGFF3-1.2.6-py2.7.egg/EMBLmyGFF3/EMBLmyGFF3.py", line 1383, in main
    writer.write_all( outfile )
  File "/cm/shared/apps/emblmygff3/1.2.6/lib/python2.7/site-packages/EMBLmyGFF3-1.2.6-py2.7.egg/EMBLmyGFF3/EMBLmyGFF3.py", line 1179, in write_all
    self._add_mandatory()
  File "/cm/shared/apps/emblmygff3/1.2.6/lib/python2.7/site-packages/EMBLmyGFF3-1.2.6-py2.7.egg/EMBLmyGFF3/EMBLmyGFF3.py", line 195, in _add_mandatory
    if seq[end] == 'n' :
IndexError: string index out of range

Is the problem due to the files I'm using as input?

Installation inconveniences

Hello all,

thank you for your tool - very useful in dealing with EMBL/GBK/GFF3 formatting nightmare.

I've recommended it to several people who need to do submissions of annotated genomes. One thing that makes it hard to use is the fact that most people use bioconda now - thus both python and pip are installed via conda. This breaks EMBLmyGFF3 - neither installation command specified in the readme provides a working script. If you could make it easier to install in a cluster environment, it would be great.

Problems of writing lone sequences

For a 370Mb chromosome arm, it is going to take me 4-5 days to convert the the GFF3 to EMBL. The feature part was done in a few second, And then the sequence 'SQ' part takes so long time. Is it normal?

But it's very fast to write short sequences with hundreds of KB, Wonder why?

When an error occurs, give information about location of error in GFF3 file

Thank you for this great tool! Very useful! and flexible enough to adapt to home attribute tags.
Please find below a suggestion for improvement.

Using this tool, I faced some errors with my big input GFF3 file (>300k features), e.g.:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 5: ordinal not in range(128)
ERROR feature: Stop codon found within the CDS. It will rise an error submiting the data to ENA. Please fix your gff file.
Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.

And it was not easy to find out what character or what CDS was the wrong one in the GFF3 file. I think it would be great to provide information about the location of the error (e.g. line number, feature ID).

ZeroDivisionError (when --no_progress activated)

Hello

I am trying to use EMBLmyGFF3 and I get the following error:

Traceback (most recent call last):
  File "/scratch/OSR/bin/EMBLmyGFF3/scripts/EMBLmyGFF3", line 11, in <module>
    load_entry_point('EMBLmyGFF3==1.2.6', 'console_scripts', 'EMBLmyGFF3')()
  File "/home/psur9757/.local/lib/python2.7/site-packages/EMBLmyGFF3/EMBLmyGFF3.py", line 1386, in main
    EMBL.print_progress(True)
  File "/home/psur9757/.local/lib/python2.7/site-packages/EMBLmyGFF3/EMBLmyGFF3.py", line 386, in print_progress
    progress = "[" + "="* int(78 * (float(EMBL.progress)/EMBL.total_features))
ZeroDivisionError: float division by zero

I checked the biopython and bcbgio-gff versions

$ python --version
Python 2.7.9
$ python -c "import Bio; from BCBio import GFF; print 'biopython version: '+Bio.version; print 'bcbio-gff version: '+GFF.version"
biopython version: 1.67
bcbio-gff version: 0.6.4

My command:

samtools faidx $genome $scaffold -o $FASTAs/${scaffold}.fa
grep "^$scaffold" $gff > $GFFs/${scaffold}.gff3
$project/EMBLmyGFF3 --shame --no_progress --ra $AUTHOR --rg $REFERENCE_GROUP -i $LOCUS_TAG -p $PROJECT -m "$MOLECULE" -r $TABLE -t linear -s "$SPECIES" -x $TAXONOMY -o $EMBLs/${scaffold}.embl $GFFs/${scaffold}.gff3 $FASTAs/${scaffold}.fa

EMBL flatfile not compatible with PAGIT RATT

Thanks for the conversion software. I have a query that might be a bit off track.

I am trying to use the output .embl (1.31 GB) from augustus gff3 with RATT software (run with linux) and get this error:

I am using the reference.fa. Please make sure that the description line of each fasta entry is the same than in the embl file name!

Just wonder if this is a known issue?

many thanks!

Download

I've tried downloading the program with all 3 options but I get errors everytime I try to download it or run it and I'm not sure what is wrong. Is it because I have the updated version of python?

Traceback (most recent call last):
File "/Users/chengh1/miniconda3/bin/EMBLmyGFF3", line 33, in
sys.exit(load_entry_point('EMBLmyGFF3==2', 'console_scripts', 'EMBLmyGFF3')())
File "/Users/chengh1/miniconda3/bin/EMBLmyGFF3", line 25, in importlib_load_entry_point
return next(matches).load()
File "/Users/chengh1/miniconda3/lib/python3.8/importlib/metadata.py", line 77, in load
module = import_module(match.group('module'))
File "/Users/chengh1/miniconda3/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 783, in exec_module
File "", line 219, in _call_with_frames_removed
File "/Users/chengh1/miniconda3/lib/python3.8/site-packages/EMBLmyGFF3-2-py3.8.egg/EMBLmyGFF3/init.py", line 3, in
from .EMBLmyGFF3 import *
File "/Users/chengh1/miniconda3/lib/python3.8/site-packages/EMBLmyGFF3-2-py3.8.egg/EMBLmyGFF3/EMBLmyGFF3.py", line 4, in
from .modules.feature import Feature
File "/Users/chengh1/miniconda3/lib/python3.8/site-packages/EMBLmyGFF3-2-py3.8.egg/EMBLmyGFF3/modules/feature.py", line 8, in
from Bio.Seq import Seq
ModuleNotFoundError: No module named 'Bio'

Issue with translating genes on complement strand

Hello,

I've come across an issue with how CDS features are printed for genes encoded on the complementary strand. The problem manifests itself clearly when using the --translate flag, as it produces lots of erroneous translations riddled with stop codons *.

I give an example below.

The EMBL output for an affected gene looks like:

FT   gene            complement(123273..128445)
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.g00007"
FT   mRNA            complement(join(128366..128445,126919..127115,124188..124
FT                   406,123273..123484))
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007"
FT   exon            complement(128366..128445)
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007-E1"
FT   exon            complement(126919..127115)
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007-E2"
FT   exon            complement(124188..124406)
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007-E3"
FT   exon            complement(123273..123484)
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007-E4"
FT   CDS             complement(join(<128366..128445,126919..127115,124188..12
FT                   4406,123273..>123484))
FT                   /locus_tag="BANY_locus6"
FT                   /codon_start=1
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007-CDS"
FT                   /translation="QKFI*SNIWC*HLVIRS*TTNALTLVCVTFSACRRGSSIRCRVVS
FT                   LHVAAALSSRAMEIPPRAMTTPL*VSS*QTNMDRE*RASNDRHTVVQRNVWRTCEDRKI
FT                   DS*RRNSNRKRLSV*GRCR*CCF*MWFR*L**MGSSYKL*FGEKCEIIKISKPIKSHWA
FT                   KENNLNLNELLSDGEYKELYRLAMIKWSEDMREKDYGCFCRAACENDVSTSNFTVQR*E
FT                   KVWQRFFN*SLKRK"
FT                   /transl_table=1

The mRNA feature looks fine, but there are some puzzling < and > characters in the CDS feature that I think may be the problem. The translation is then subsequently messed up, and in fact appears to be the translation for the exons in reverse order, as QKFI* corresponds to the first 4 "codons" of the last exon (E4, 123273..123484).

Hopefully an easy issue, and thanks for a great tool, this is going to extremely useful :-)

Or maybe something funny in the GFF? the entry for this gene is:

BANY00001       GenomeHubs      gene    123273  128445  .       -       .       ID=BANY.1.2.g00007
BANY00001       GenomeHubs      mRNA    123273  128445  .       -       .       ID=BANY.1.2.t00007;Parent=BANY.1.2.g00007
BANY00001       GenomeHubs      exon    128366  128445  .       -       .       ID=BANY.1.2.t00007-E1;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      exon    126919  127115  .       -       .       ID=BANY.1.2.t00007-E2;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      exon    124188  124406  .       -       .       ID=BANY.1.2.t00007-E3;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      exon    123273  123484  .       -       .       ID=BANY.1.2.t00007-E4;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      CDS     128366  128445  .       -       0       ID=BANY.1.2.t00007-CDS;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      CDS     126919  127115  .       -       2       ID=BANY.1.2.t00007-CDS;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      CDS     124188  124406  .       -       1       ID=BANY.1.2.t00007-CDS;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      CDS     123273  123484  .       -       1       ID=BANY.1.2.t00007-CDS;Parent=BANY.1.2.t00007

Running biopython version: 1.67 and bcbio-gff version: 0.6.4

How to avoid duplicated db_xref entries in CDS?

Hi,
I encountered multiple, indentical db_xref entries for CDS features with more than on exon. How can I avoid this? Do I really have to limit the db_xref information to one of multiple CDS entries of one feature in the gff file?

Thanks in advance
Best regards
Nadine

Improve performance (EMBLmyGFF3 v3)

We have a user who has used a huge GFF annotation of 3.34 GB. It took ~ 24h computation and apparently it has used more than 50 GB of memory...
We should investigate how to optimise the speedness and the memory usage.

Thread mentioning this here.

Bio.Alphabet issue

Hi,

Thanks for this software, I'm looking forward to using it. I've just installed it via conda, but I am having some issues that relate to an update of biopython. Can you tell me the versions of biopython and bcbio-gff that you were able to run the software on?

The install I ran today is running into issues with Bio.Alphabet. See the error:

    "Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the ``molecule_type`` as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information."
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the ``molecule_type`` as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

Here is my detailed conda environment information.

channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=1_gnu
  - bcbio-gff=0.6.6=pyh864c0ab_1
  - biopython=1.78=py37h8f50634_0
  - bx-python=0.8.9=py37h73d7ac5_2
  - ca-certificates=2020.6.20=hecda079_0
  - certifi=2020.6.20=py37hc8dfbb8_0
  - emblmygff3=2=py_0
  - ld_impl_linux-64=2.35=h769bd43_9
  - libblas=3.8.0=17_openblas
  - libcblas=3.8.0=17_openblas
  - libffi=3.2.1=he1b5a44_1007
  - libgcc-ng=9.3.0=h5dbcf3e_17
  - libgfortran-ng=7.5.0=hae1eefd_17
  - libgfortran4=7.5.0=hae1eefd_17
  - libgomp=9.3.0=h5dbcf3e_17
  - liblapack=3.8.0=17_openblas
  - libopenblas=0.3.10=pthreads_hb3c22a3_4
  - libstdcxx-ng=9.3.0=h2ae2ef3_17
  - lzo=2.10=h516909a_1000
  - ncurses=6.2=he1b5a44_1
  - numpy=1.16.4=py37h95a1406_0
  - openssl=1.1.1h=h516909a_0
  - pip=20.2.3=py_0
  - python=3.7.8=h6f2ec95_1_cpython
  - python-lzo=1.12=py37h81344f2_1001
  - python_abi=3.7=1_cp37m
  - readline=8.0=he28a2e2_2
  - setuptools=49.6.0=py37hc8dfbb8_1
  - six=1.15.0=pyh9f0ad1d_0
  - sqlite=3.33.0=h4cf870e_1
  - tk=8.6.10=hed695b0_1
  - wheel=0.35.1=pyh9f0ad1d_0
  - xz=5.2.5=h516909a_1
  - zlib=1.2.11=h516909a_1009

Error with embl file during ENA webin-cli validation

Hi there,

I was trying to validate the embl files from EMBLmyGFF3 on ENA's webin-cli but I got the below error message.

ERROR: The qualifier "isolation_source" must exist when qualifier "environmental_sample" exists within the same feature.

I used taxid: 77133 (uncultured bacterium) at the time of running EMBLmyGFF3. To give you more of a background, I'm trying to submit a genome annotation file of an uncultivated bacterium. Any help here would be very much appreciated. Thanks in advance for your excellent support (as always!).

Find a way to make the json file more easily accessible

Since we have wrapped the tool up as python module to ease the installation and make sure that people use the correct version of the dependencies, the json mapping files are less easily accessible.
We could make them mandatory to have them where the tool is launched. If they are missing we copy past the default json files locally and use them.

How to add in comment or CC line

How should you input text that you want to be incorporated into the CC comments or notes line?

EMBL to fasta ?

Hi,

Thanks for the script it works well so far. During the process some of my scaffolds have been discarded (because too short for EBI).
Do you have a reverse script to convert the EMBL to FASTA in order for me to recompute some statistics (N50 etc.) or do you know a script that can perform directly some statistics from EMBL format ?

Best,

Accession number same as contig name.

Hi,
I am trying to use this tool to make an embl file for upload to IMG, however I am having issues with the output. First thing, is there a way to assign ID as the contig name?, in IMG they dont accept files with XXX as id name.
Thanks

dx_xref tool crash when one db not accepted

We should just not report db_xref not accepted

option to flag or remove gene models with short exon (<10 nt)

mRNA with Short introns (<10 bp) are not accepted for submission. Would be nice to catch those cases. It would be easy to find them looking at the list of coordinates from the mRNA features.

Webin-CLI validation failing due to duplicated feature locations in EMBLmyGFF3 flat file

I am getting the following errors when I try to validate my EMBLmyGFF3-generated flat file through Webin-CLI.

head genome/Pmacd_v0.10/validate/Pmacd_v0.10_ENAsubmit.embl.gz.report
ERROR: "exon" Features locations are duplicated - consider merging qualifiers. [ line: 6951 of Pmacd_v0.10_ENAsubmit.embl.gz,  line: 6921 of Pmacd_v0.10_ENAsubmit.embl.gz]
ERROR: "exon" Features locations are duplicated - consider merging qualifiers. [ line: 7321 of Pmacd_v0.10_ENAsubmit.embl.gz,  line: 7283 of Pmacd_v0.10_ENAsubmit.embl.gz]
ERROR: "exon" Features locations are duplicated - consider merging qualifiers. [ line: 13785 of Pmacd_v0.10_ENAsubmit.embl.gz,  line: 13757 of Pmacd_v0.10_ENAsubmit.embl.gz]

My gff has the format:

Pmacd_v0.10_Sc0000027_pilon     AUGUSTUS        gene    14395   28338   .       -       .       ID=PmacdG00000006135
Pmacd_v0.10_Sc0000027_pilon     AUGUSTUS        transcript      14395   28338   .       -       .       ID=PmacdG00000006135.1;Parent=PmacdG00000006135
Pmacd_v0.10_Sc0000027_pilon     AUGUSTUS        exon    14395   14538   .       -       .       ID=PmacdG00000006135.1-exon1;Parent=PmacdG00000006135.1
Pmacd_v0.10_Sc0000027_pilon     AUGUSTUS        exon    25250   25354   .       -       .       ID=PmacdG00000006135.1-exon2;Parent=PmacdG00000006135.1
Pmacd_v0.10_Sc0000027_pilon     AUGUSTUS        exon    28297   28338   .       -       .       ID=PmacdG00000006135.1-exon3;Parent=PmacdG00000006135.1
Pmacd_v0.10_Sc0000027_pilon     AUGUSTUS        CDS     14395   14538   .       -       0       ID=PmacdG00000006135.1-cds1;Parent=PmacdG00000006135.1
Pmacd_v0.10_Sc0000027_pilon     AUGUSTUS        CDS     25250   25354   .       -       0       ID=PmacdG00000006135.1-cds1;Parent=PmacdG00000006135.1

I ran EMBLmyGFF3 using the command:

/home/racste/.local/bin/EMBLmyGFF3  Pmacd_v0.10_braker_gffread_merge_mod_nogeneID.gff Pmacd_v0.10.fasta \
        --topology linear \
        --molecule_type 'genomic DNA' \
        --transl_table 1  \
        --species 'Pieris macdunnoughii' \
        --project_id PRJEB42400 \
        -o result.embl \
        -locus_tag PMACD

Of course my first thought was to remove the duplicates with agat_sp_fix_features_locations_duplicated.pl, but there were no duplicates detected:

=> OmniscientI total time: 198 seconds
Pmacd_v0.10_braker_gffread_merge_mod_nogeneID.gff file parsed

We found 0 cases where isoforms have identical exon structures (we removed duplicates by keeping the one with longest CDS).
We found 0 cases where l2 from different gene identifier have identical exon but no CDS at all (we removed one duplicate).
We found 0 cases where l2 from different gene identifier have identical exon and CDS structures (we removed duplicates by keeping the one with longest CDS).
We found 0 cases where l2 from different gene identifier have identical exon structures (we reshaped UTRs to modify gene locations).
Whe removed 0 genes because no more l2 were linked to them.
We found 0 cases where 2 genes have same location while CDS are differents. In that case we modified the gene locations by clipping UTRs.

Here's an example of one of the overlapping features from all three files:

# webin-CLI report
ERROR: "exon" Features locations are duplicated - consider merging qualifiers. [ line: 6951 of Pmacd_v0.10_ENAsubmit.embl.gz,  line: 6921 of Pmacd_v0.10_ENAsubmit.embl.gz]

# EMBLmyGFF3 output
#line: 6921-6924 of Pmacd_v0.10_ENAsubmit.embl.gz
FT   exon            complement(2756392..2756483)
FT                   /locus_tag="PMACD_LOCUS154"
FT                   /note="ID:PmacdG00000009802.2-exon5"
FT                   /note="source:AUGUSTUS"

#line: 6951-6954 of Pmacd_v0.10_ENAsubmit.embl.gz
FT   exon            complement(2756392..2756483)
FT                   /locus_tag="PMACD_LOCUS155"
FT                   /note="ID:PmacdG00000009803.1-exon2"
FT                   /note="source:GeneMark.hmm"

# gff3 input
# relevant exon is starred (**)
grep PmacdG00000009802 *nogeneID.gff
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    gene    2751271 2756483 .       -       .       ID=PmacdG00000009802
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    transcript      2751271 2752952 .       -       .       ID=PmacdG00000009802.1;Parent=PmacdG00000009802
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    exon    2751271 2751537 0       -       .       ID=PmacdG00000009802.1-exon1;Parent=PmacdG00000009802.1
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    exon    2752178 2752342 0       -       .       ID=PmacdG00000009802.1-exon2;Parent=PmacdG00000009802.1
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    exon    2752767 2752952 0       -       .       ID=PmacdG00000009802.1-exon3;Parent=PmacdG00000009802.1
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    CDS     2751271 2751537 .       -       0       ID=PmacdG00000009802.1-cds1;Parent=PmacdG00000009802.1
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    CDS     2752178 2752342 .       -       0       ID=PmacdG00000009802.1-cds1;Parent=PmacdG00000009802.1
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    CDS     2752767 2752952 .       -       0       ID=PmacdG00000009802.1-cds1;Parent=PmacdG00000009802.1
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        transcript      2751271 2756483 .       -       .       ID=PmacdG00000009802.2;Parent=PmacdG00000009802
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        exon    2751271 2751537 .       -       .       ID=PmacdG00000009802.2-exon1;Parent=PmacdG00000009802.2
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        exon    2752178 2752342 .       -       .       ID=PmacdG00000009802.2-exon2;Parent=PmacdG00000009802.2
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        exon    2752767 2752930 .       -       .       ID=PmacdG00000009802.2-exon3;Parent=PmacdG00000009802.2
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        exon    2755781 2755980 .       -       .       ID=PmacdG00000009802.2-exon4;Parent=PmacdG00000009802.2
** Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        exon    2756392 2756483 .       -       .       ID=PmacdG00000009802.2-exon5;Parent=PmacdG00000009802.2
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        CDS     2751271 2751537 .       -       0       ID=PmacdG00000009802.2-cds1;Parent=PmacdG00000009802.2
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        CDS     2752178 2752342 .       -       0       ID=PmacdG00000009802.2-cds1;Parent=PmacdG00000009802.2
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        CDS     2752767 2752930 .       -       2       ID=PmacdG00000009802.2-cds1;Parent=PmacdG00000009802.2
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        CDS     2755781 2755980 .       -       1       ID=PmacdG00000009802.2-cds1;Parent=PmacdG00000009802.2
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        CDS     2756392 2756483 .       -       0       ID=PmacdG00000009802.2-cds1;Parent=PmacdG00000009802.2

# relevant exon is starred (**)
grep PmacdG00000009803 *nogeneID.gff
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    gene    2755740 2756483 .       -       .       ID=PmacdG00000009803
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    transcript      2755740 2756483 .       -       .       ID=PmacdG00000009803.1;Parent=PmacdG00000009803
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    exon    2755740 2755980 0       -       .       ID=PmacdG00000009803.1-exon1;Parent=PmacdG00000009803.1
** Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    exon    2756392 2756483 0       -       .       ID=PmacdG00000009803.1-exon2;Parent=PmacdG00000009803.1
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    CDS     2755740 2755980 .       -       1       ID=PmacdG00000009803.1-cds1;Parent=PmacdG00000009803.1 
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    CDS     2756392 2756483 .       -       0       ID=PmacdG00000009803.1-cds1;Parent=PmacdG00000009803.1

Do you have any other suggestions of how to fix this error so I can validate and submit my flat file?

Thanks!
Rachel

nbisweden / emblmygff3 Goto Github PK

emblmygff3's Issues

Recommend Projects

Recommend Topics

Recommend Org