nbisweden / emblmygff3 Goto Github PK

View Code? Open in Web Editor NEW

58.0 40.0 16.0 31.93 MB

An efficient way to convert gff3 annotation files into EMBL format ready to submit.

License: GNU General Public License v3.0

Python 99.09% Shell 0.91%

embl gff3 converter submission ena

emblmygff3's Introduction

EMBLmyGFF3

GFF3 to EMBL conversion tool

EMBLmyGFF3 converts an assembly in FASTA format along with associated annotation in GFF3 format into the EMBL flat file format which is the required format for submitting annotated assemblies to ENA.

[ Similarly to prepare your data for submission to NCBI please use Genome Annotation Generator - GAG. ]

! NCBI and ENA are part of INSDC and their data are synchronised every day, so everything submitted in one of this DB will also be accessible in the other.

Based on documentation from:

http://www.insdc.org/files/feature_table.html
http://www.ebi.ac.uk/ena/WebFeat/
ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/FT_current.html#7.1.1
ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt.

You don't know how to submit to ENA ? Please visit the ENA: Guidelines and Tips

GFF3 to EMBL conversion tool

Prerequisites

Python >=3.9, biopython >=1.78, numpy >=1.22 and the bcbio-gff >=0.6.4 python packages.

In order to install pip please use the following steps:

Mac OS X / LINUX:

Install pip the python package manager:

sudo easy_install pip

biopython and bcbio-gff will be installed automatically with the next steps

Installation

Installation with conda:

conda install -c bioconda emblmygff3

Installation with pip:

pip install git+https://github.com/NBISweden/EMBLmyGFF3.git

or if you do not have administartive rights on your machine

pip install --user git+https://github.com/NBISweden/EMBLmyGFF3.git

Installation with git:

Clone the repository:

git clone https://github.com/NBISweden/EMBLmyGFF3.git

Move into the folder:

cd EMBLmyGFF3/

Install:

python setup.py install

or if you do not have administartive rights on your machine:

python setup.py install --user

Check installation

Executing:

EMBLmyGFF3

EMBLmyGFF3 -h

will display some help.

Update

Update with pip:

pip install git+https://github.com/NBISweden/EMBLmyGFF3.git --upgrade

or if you do not have administartive rights on your machine

pip install --user git+https://github.com/NBISweden/EMBLmyGFF3.git --upgrade

Update with git:

Move into the repository folder and execute:

git pull

Uninstall

pip uninstall EMBLmyGFF3

Usage

Foreword

A correct GFF3 file and the genome in FASTA format that has been used to produce the GFF3 file are the mandatory input files. Then, in order to get a valid EMBL flat file suitable for submission you have to fill carefully all mandatory metadata.

/!\ Please be aware that a project ID and an locus tag are mandatory for a submission to ENA. You don't need this information if you don't plan to submit the data (In case you just want an EMBL-like flat file for other purposes). If you don't have yet those information you can add them later by replacing the corresponding fields.
To learn how to obtain a project ID click here.
To learn how to obtain a locus tag click here.

Use provided examples

Three examples are provided with the tool and are located in the examples folder. You can try one of the 3 examples by calling EMBLmyGFF3-maker-example or EMBLmyGFF3-augustus-example or EMBLmyGFF3-prokka-example accordingly. This way is convenient when you performed the installation using pip.

EMBLmyGFF3-maker-example

If you installed EMBLmyGFF3 using git those example files are located where you have cloned the repository in EMBLmyGFF3/examples/. You could also access these files by downloading the examples folder here. You can then try the examples moving in the examples folder and launching one of the .py or .sh executable like that:

python maker_example.py

./maker_example.sh

Simple case

EMBLmyGFF3 maker.gff3 maker.fa

Will prompt you to fill one by one the mandatory information needed to produce a proper EMBL file. Once the software has all the information it needs, it will process the input files and will print the result to STDOUT.

In order to write the result in the desired file use the -o option:

EMBLmyGFF3 maker.gff3 maker.fa -o result.embl

Complete case

Minimum requirement to launch the software and avoid any prompt.

EMBLmyGFF3 maker.gff3 maker.fa \
        --topology linear \
        --molecule_type 'genomic DNA' \
        --transl_table 1  \
        --species 'Drosophila melanogaster' \
        --locus_tag LOCUSTAG \
        --project_id PRJXXXXXXX \
        -o result.embl

Advanced case 1

Adding more information than those mandatory (filling the ID).

EMBLmyGFF3 maker.gff3 maker.fa \
        --data_class STD \
        --topology linear \
        --molecule_type "genomic DNA" \
        --transl_table 1  \
        --species 'Drosophila melanogaster' \
        --taxonomy INV \
        --locus_tag LOCUSTAG \
        --project_id PRJXXXXXXX \
        --rg MYGROUP \
        -o result.embl

Advanced case 2

Adding more information than those mandatory (filling the ID line + publication and authors information).

EMBLmyGFF3 maker.gff3 maker.fa \
        --data_class STD \
        --topology linear \
        --molecule_type "genomic DNA" \
        --transl_table 1  \
        --species 'Drosophila melanogaster' \
        --taxonomy INV \
        --locus_tag LOCUSTAG \
        --project_id PRJXXXXXXX \
        --rg MYGROUP \
        --author 'author for the reference' \
        --rt 'reference title' \
        --rl 'Some journal' \
        -o result.embl

Use through a script

You may prefer to launch the software through a script especially when you want to fill many information, so we provide examples of such scripts in bash (.sh) or python (.py) in the examples folder.

Parameters

Some parameters are mandatory and some others are not. Here is a list of all parameters available. You can also find a comprehensive help about the different parameters using the software -h or --help command and even a more advanced help using --ah X or --advanced_help X where X is the parameter you would like to learn more about.

positional arguments:

Parameter	Description
gff_file	Input gff-file.
fasta	Input fasta sequence.

Mandatory Arguments related to the EMBL format to check carrefully:

Parameter	Description
-i, --locus_tag	Locus tag. Used to set up the prefix of the locus_tag qualifier. It has to be registered at ENA prior any submission. More information here. The default is XXX.
-p, --project_id	Project ID. Default is 'XXX' (This is used to set up the PR line).
-r, --transl_table	Translation table. No default. (This is used to set up the translation table qualifier transl_table of the CDS features.) Please visit NCBI genetic code for more information.
-s, --species	Sample species, formatted as 'Genus species' or taxid. No default. (This is used to set up the OS line.)
-t, --topology	Sequence topology. No default. (This is used to set up the Topology that is the 3rd token of the ID line.)
-m, --molecule_type	Molecule type of the sample. No default value.

Optional arguments related to the EMBL format:

Parameter	Description
-a , --accession	Bolean. Accession number(s) for the entry. Default value: XXX. The proper value is automatically filled up by ENA during the submission by a unique accession number they will assign. The accession number is used to set up the AC line and the first token of the ID line as well. Please visit this page and this one to learn more about it. Activating the option will set the Accession number with the fasta sequence identifier.
-c , --created	Creation time of the original entry. The default value is the date of the day.
-d , --data_class	Data class of the sample. Default value 'XXX'. This option is used to set up the 5th token of the ID line.
-g , --organelle	Sample organelle. No default value.
-k , --keyword	Keywords for the entry. No default value.
-l , --classification	Organism classification e.g 'Eukaryota; Opisthokonta; Metazoa'. The default value is the classification found in the NCBI taxonomy DB from the species/taxid given as --species parameter. If none is found, 'Life' will be the default value.
-x , --taxonomy	Source taxonomy. Default value 'XXX'. This option is used to set the taxonomic division within ID line (6th token).
--de	Description. Default value 'XXX'.
--ra , --author	Author for the reference. No default value.
--rc	Reference Comment. No default value.
--rg	Reference Group, the working groups/consortia that produced the record. Default value 'XXX'.
--rl	Reference publishing location. No default value.
--rt	Reference Title. No default value.
--rx	Reference cross-reference. No default value.
--email	Email used to fetch information from NCBI taxonomy database. Default value '[email protected]'.
--environmental_sample	Identifies sequences derived by direct molecular isolation from a bulk environmental DNA sample with no reliable identification of the source organism. May be needed when organism belongs to Bacteria.
--expose_translations	Copy feature and attribute mapping files to the working directory. They will be used as mapping files instead of the default internal JSON files. You may modify them as it suits you.
--force_unknown_features	Force to keep feature types not accepted by EMBL. /!\ Option not suitable for submission purpose.
--force_uncomplete_features	Force to keep features whithout all the mandatory qualifiers. /!\ Option not suitable for submission purpose.
--interleave_genes	Print gene features with interleaved mRNA and CDS features.
--isolate	Individual isolate from which the sequence was obtained. May be needed when organism belongs to Bacteria.
--isolation_source	Describes the physical, environmental and/or local geographical source of the biological sample from which the sequence was derived. Mandatory when environmental_sample option used.
--keep_duplicates	Do not remove duplicate features during the process. /!\ Option not suitable for submission purpose.
--keep_short_sequences	Do not remove short sequences (< 100bp) during the process. /!\ Option not suitable for submission purpose.
--locus_numbering_start	Start locus numbering with the provided value.
--no_progress	Hide conversion progress counter.
--no_wrap_qualifier	By default there is a line wrapping at 80 characters. The cut is at the world level. Activating this option will avoid the line-wrapping for the qualifiers.
--translate	Include translation in CDS features.
--use_attribute_value_as_locus_tag	Use the value of the defined attribute as locus_tag.
--version	Sequence version number. The default value is 1.
--strain	Strain from which sequence was obtained. May be needed when organism belongs to Bacteria.

Optional arguments related to the software:

Parameter	Description
--ah, --advanced_help	Display advanced information of the parameter specified or of all parameters if none specified.
-h, --help	Show this help message and exit.
-v, --verbose	Increase verbosity.
-q, --quiet	Decrease verbosity.
--shame	Suppress the shameless plug.
-z, --gzip	Gzip output file.
--uncompressed_log	Some logs can be compressed for better lisibility, they won't.
-o , --output	Output filename.

Mapping

The challenge for a correct conversion is the correct mapping between the feature types described in the 3th column as well as the different attribute’s tags of the 9th column of the GFF3 file and the corresponding EMBL features and qualifiers.
If you figure out that a feature type or an attribute's tag is not mapped to the corresponding EMBL features or qualifiers you would like, you will have to modify the corresponding information in the mapping files.
The software will skip the unknown feature types (Non EMBL feature types that are not mapped against an EMBL feature type) and the unknown qualifiers (Non EMBL qualifiers that are not mapped against an EMBL qualifier) and will inform you during the conversion process. If you want to include them within the output, you can add the information needed in the corresponding mapping file.

To access the json mapping files launch the following command:

EMBLmyGFF3 --expose_translations

The command copy the json mapping files localy. You can then modify them as it suits you. When a json mapping file is present localy, it will be used instead of the default internal one.

Feature type

The EMBL format accepts 52 different feature types whereas the GFF3 is constrained to be a Sequence Ontology term or accession number (3th column of the GFF3), but nevertheless this constitutes 2278 terms in version 2.5.3 of the Sequence Ontology.

The file handling the proper mapping is translation_gff_feature_to_embl_feature.json

example:

"three_prime_UTR": {
    "target": "3'UTR"
}

This will map the three_prime_UTR feature type from the 3th column of the GFF3 file to the 3'UTR EMBL feature type. When the feature type from the GFF3 is identical to the EMBL feature no need to specify any target. If a target is needed and you didn't specified it, the tool will throw a warning message during the process.

You can decide which features will be printed in the output using the remove parameter:

"exon": {
    "remove": true
}

Like that no exon feature will be display in the output.

GFF3 Attribute to EMBL qualifier

The embl format accepts 98 different qualifiers where the corresponding attribute tag types in the 9th column of the GFF3 are unlimited. The file handling the proper mapping is translation_gff_attribute_to_embl_qualifier.json

example:

"Dbxref": {
    "source description": "A database cross reference.",
    "target": "db_xref",
    "dev comment": ""
},

This will map the Dbxref attribute's tag from the 9th columm of the GFF3 file to the db_xref embl qualifier.

Other

The source (2nd column) as well as the score (6th column) from the GFF3 file can also be handled through the translation_gff_other_to_embl_qualifier.json mapping file.

"source": {
    "source description": "The source is a free text qualifier intended to describe the algorithm or
                           operating procedure that generated this feature. Typically this is the name of
                           a piece of software, such as Genescan or a database name, such as Genbank. In
                           effect, the source is used to extend the feature ontology by adding a qualifier
                           to the type creating a new composite type that is a subclass of the type in the
                           type column.",
    "target": "note",
    "prefix": "source:",
    "dev comment": "EMBL qualifiers tend to be more specific than this, so very hard to create a good
                    mapping."
},

This will map the source from the 2nd columm of the GFF3 file to the note embl qualifier.

/!\ Please notice the prefix allows to add information dowstream the source value wihtin the qualifier (Upstream information is also possible using suffix).
e.g: The source value is "Prokka":
Within the embl file, instead to get note="Prokka", here we will get note="source:Prokka"

Validate your embl flat file

The output can be validated using the ENA flat file validator distributed by EMBL. Please visit http://www.ebi.ac.uk/ena/software/flat-file-validator and/or https://github.com/enasequence/sequencetools for more information.

Known issues

biopython version There's a bug between bcbio-gff 0.6.4 and biopython 1.68 though, so use biopython 1.67.

If you have several version of biopython or bcbio-gff on your computer it is possible that an incompatible version is used by default which will lead to an execution error. To check the real version used during the execution you can use this command:

python -c "import Bio; from BCBio import GFF; print 'biopython version: '+Bio.__version__; print 'bcbio-gff version: '+GFF.__version__"

Duplicated Features
Features that have the same key (feature type) and location as another feature are considered as duplicates and aren't allowed by the EMBL database. So they are remove during the process. If you don't plan to submit the file to ENA and you wish to keep these features, use the --keep_duplicates option.

Citation

Norling M, Jareborg N, Dainat J. EMBLmyGFF3: a converter facilitating genome annotation submission to European Nucleotide Archive. BMC Res Notes. 2018 Aug 13;11(1):584. doi: 10.1186/s13104-018-3686-x.

And

Author

Martin Norling^1,2, Niclas Jareborg^1,3, Jacques Dainat^1,2

¹National Bioinformatics Infrastructure Sweden (NBIS), SciLifeLab, Uppsala Biomedicinska Centrum (BMC), Husargatan 3, S-751 23 Uppsala, SWEDEN.
²IMBIM - Department of Medical Biochemistry and Microbiology, Box 582, S-751 23 Uppsala, SWEDEN.
³Department of Biochemisty and Biophys-ics, Stockholm University / SciLifeLab, Box 1031, S-171 21 Solna, SWEDEN.

emblmygff3's People

Contributors

Stargazers

Watchers

Forkers

ibebio envgen dayedepps libingnan11 loraine-gueguen jhh130910 xvazquezc genomicsnx xinggao-pki angelamuraya yamaton tinyfallen yuzhenpeng crsky1023 bartns gauravcodepro

emblmygff3's Issues

Warning regarding protein ID

Hello again. I'm getting the following warning:

WARNING feature: The value(s) ['AAEL012102-PB'] is(are) invalid for the qualifier protein_id of the feature CDS. We will not report the qualifier. (Here is the regex expected: [a-zA-Z]{3}[0-9]{5}\.[0-9]+)

I guess the hyphen in the name is causing an issue? All the protein IDs in my GFF3 file have a hyphen and end up triggering this error (after a point the program just gets tired of them and quits printing them). I would like to preserve this information in my EMBL file, can you suggest a fix?

When no strand for CDS by default the negative strand is used

When no strand for CDS by default the negative strand is used. Probably More coherent to be plus strand by default.
2)We must raise an error when a CDS does not have any strand information.
This is awkward but tool like Ugene can create CDS wihtout strand.

ZeroDivisionError (when --no_progress activated)

Hello

I am trying to use EMBLmyGFF3 and I get the following error:

Traceback (most recent call last):
  File "/scratch/OSR/bin/EMBLmyGFF3/scripts/EMBLmyGFF3", line 11, in <module>
    load_entry_point('EMBLmyGFF3==1.2.6', 'console_scripts', 'EMBLmyGFF3')()
  File "/home/psur9757/.local/lib/python2.7/site-packages/EMBLmyGFF3/EMBLmyGFF3.py", line 1386, in main
    EMBL.print_progress(True)
  File "/home/psur9757/.local/lib/python2.7/site-packages/EMBLmyGFF3/EMBLmyGFF3.py", line 386, in print_progress
    progress = "[" + "="* int(78 * (float(EMBL.progress)/EMBL.total_features))
ZeroDivisionError: float division by zero

I checked the biopython and bcbgio-gff versions

$ python --version
Python 2.7.9
$ python -c "import Bio; from BCBio import GFF; print 'biopython version: '+Bio.version; print 'bcbio-gff version: '+GFF.version"
biopython version: 1.67
bcbio-gff version: 0.6.4

My command:

samtools faidx $genome $scaffold -o $FASTAs/${scaffold}.fa
grep "^$scaffold" $gff > $GFFs/${scaffold}.gff3
$project/EMBLmyGFF3 --shame --no_progress --ra $AUTHOR --rg $REFERENCE_GROUP -i $LOCUS_TAG -p $PROJECT -m "$MOLECULE" -r $TABLE -t linear -s "$SPECIES" -x $TAXONOMY -o $EMBLs/${scaffold}.embl $GFFs/${scaffold}.gff3 $FASTAs/${scaffold}.fa

Problems of writing lone sequences

For a 370Mb chromosome arm, it is going to take me 4-5 days to convert the the GFF3 to EMBL. The feature part was done in a few second, And then the sequence 'SQ' part takes so long time. Is it normal?

But it's very fast to write short sequences with hundreds of KB, Wonder why?

ID not taken as locus tag

Hi, excited to see something that may make things easier. bit of a nightmare otherwise. I have ID's within my Gff and was expecting them to be used for the locus tags but they are not and sequential numbers are instead. A note is created of the ID which I think would be better if just the locus_tag became the ID as I think that is it's purpose. I don't have gene names in gff but ideally a tab file of gene names could be given to add these to the resulting embl file too, as this is the likely starting point of having gene names available. I would also like to parse the exon number after the : and add this in, although I don't think this is essential. I'm still trying to work out the ENA format requirements for submission. I think I could just have a locus tag as the minimum feature and what I'm working towards. The webin validation tool complains about overlapping UTR and CDS features of two genes in the same direction. Could a correction part be added to cleave UTR and correct gene when detects this? As I have to work out how to fix this and start again. I know of a script somewhere that will do the cleaving of UTR at least. Sorry a few change requests or otherwise I'll try to make the changes myself when time but harder when don't know the code.

FT mRNA join(433449..433533,433946..434073,434612..434836,
FT 435438..435904)
FT /locus_tag="SPEXI_LOCUS1"
FT /note="source:maker"
FT /note="ID:SPEXI_01T000001"
FT CDS join(433449..433533,433946..434073,434612..434836,
FT 435438..435710)
FT /locus_tag="SPEXI_LOCUS1"
FT /note="source:maker"
FT /note="ID:SPEXI_01T000001:cds"
FT /transl_table=1
FT exon 433449..433533
FT /locus_tag="SPEXI_LOCUS1"
FT /note="source:maker"
FT /note="ID:SPEXI_01T000001:1"

Converts to embl but with interrogation marks in SQ

I'm working with yeasts and I really need their embl files, but when i run the programm (using bash and python), i encounter with the next problem:
The current warnings pops on the terminal (though I dont think they are the cause of the problem):

17:25:17 ERROR feature: >>trna<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.
Otherwise tell me which EMBL feature it corresponds to by adding the information within the json mapping file.
17:25:17 WARNING feature: Unknown qualifier 'NAME' - skipped
17:25:17 ERROR feature: >>trna_exon<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.
Otherwise tell me which EMBL feature it corresponds to by adding the information within the json mapping file.
17:25:30 ERROR feature: >>UTR<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.
Otherwise tell me which EMBL feature it corresponds to by adding the information within the json mapping file.
Conversion done

And, as it says on the final line, the conversion is done, when I open the generated embl the features are fine, but the sequence is all interrogation marks.

FT /transl_table=12
XX
SQ Sequence 2596028 BP; 0 A; 0 C; 0 G; 0 T; 2596028 other;
?????????? ?????????? ?????????? ?????????? ?????????? ?????????? 60
?????????? ?????????? ?????????? ?????????? ?????????? ?????????? 120

And if I keep scrolling, it is as if the conversion had started again:

 ????????                                                            2596028

//
ID XXX; XXX; linear; genomic DNA; XXX; XXX; 2596667 BP.
XX
AC XXX;
XX
AC * SOME_YEAST
XX
PR Project:XXX;

After that the only existing feature is "gap" and the sequence (SQ) is now like it is suppossed to be:
FT gap 2556681..2556981
FT /estimated_length=301
XX
SQ Sequence 2596667 BP; 806943 A; 475017 C; 477105 G; 804281 T; 33321 other;
AATCTGCTCA GTAAGGCCCA TAAATCGGCT CTGCATTTCT TCTGTGGGCA TTTTGCCGTA 60
CTTTTTTAAT TATGTTGCAG ACGAAACTGA ATCAAGCTCG TCGACAGCTT CGTACAGCCT 120

I have no idea why this would happen, I really hope you can help me figure out what is happening,
I really need those emb files.

--translate option does not work anymore

line 607 of feature.py has to be replaced by
translated_seq = str(seq.translate(codon_table)).replace('B','X').replace('Z','X').replace('J','X')

option to flag or remove gene models with short exon (<10 nt)

mRNA with Short introns (<10 bp) are not accepted for submission. Would be nice to catch those cases. It would be easy to find them looking at the list of coordinates from the mRNA features.

Improve performance (EMBLmyGFF3 v3)

We have a user who has used a huge GFF annotation of 3.34 GB. It took ~ 24h computation and apparently it has used more than 50 GB of memory...
We should investigate how to optimise the speedness and the memory usage.

Thread mentioning this here.

How to avoid duplicated db_xref entries in CDS?

Hi,
I encountered multiple, indentical db_xref entries for CDS features with more than on exon. How can I avoid this? Do I really have to limit the db_xref information to one of multiple CDS entries of one feature in the gff file?

Thanks in advance
Best regards
Nadine

thank you

I don't have an issue, I am just really grateful you wrote this!

I was so frustrated trying to generate a gap file and it was so much easier to convert it directly with the annotation to EMBL format using your scrpit! So, thank you!

Another Question: Unused Stop Codon in mitochondrial DNA provokes Error

Hi,

I've stumbeled upon another thing. In the mithochondrium of our organism not all stop codons are used as such. More precisely TGA is used to code for tryptophan and not as a stop codon. So each time a gene has a TGA inside I get the error "Stop codon found within the CDS...". Would it be possible to exclude certain stop (or even start) codons?

Thanks in advance

Best Regards,
Nadine

Accession number same as contig name.

Hi,
I am trying to use this tool to make an embl file for upload to IMG, however I am having issues with the output. First thing, is there a way to assign ID as the contig name?, in IMG they dont accept files with XXX as id name.
Thanks

When an error occurs, give information about location of error in GFF3 file

Thank you for this great tool! Very useful! and flexible enough to adapt to home attribute tags.
Please find below a suggestion for improvement.

Using this tool, I faced some errors with my big input GFF3 file (>300k features), e.g.:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 5: ordinal not in range(128)
ERROR feature: Stop codon found within the CDS. It will rise an error submiting the data to ENA. Please fix your gff file.
Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.

And it was not easy to find out what character or what CDS was the wrong one in the GFF3 file. I think it would be great to provide information about the location of the error (e.g. line number, feature ID).

Error with embl file during ENA webin-cli validation

Hi there,

I was trying to validate the embl files from EMBLmyGFF3 on ENA's webin-cli but I got the below error message.

ERROR: The qualifier "isolation_source" must exist when qualifier "environmental_sample" exists within the same feature.

I used taxid: 77133 (uncultured bacterium) at the time of running EMBLmyGFF3. To give you more of a background, I'm trying to submit a genome annotation file of an uncultivated bacterium. Any help here would be very much appreciated. Thanks in advance for your excellent support (as always!).

Joining of intron features

Currently EMBLmyGFF3 joins intron entries. Here is an example of joined introns:

FT   intron          join(6625..6675,6797..6841,6924..6966,7119..7161,
FT                   7245..7286,7423..7476,7630..7673,7750..7962,8110..8158,
FT                   8225..8265,8365..8407)
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"

As this does not make sense biologically, this issue should be fixed in later versions.

Improve -i / --locus_tag option

The following points should be checked on the locus tag prefix given with option -i / --locus_tag (required when the locus-tag is registered at ENA):

A locus tag prefix must have the following format:

starts with a letter
is at least 3 characters long
is upper case
contains only alpha-numeric characters and no symbols such as -_*

Add warning for unknown sequence

when a sequence from the gff is not found within the provided fasta file, it will create string of ???? as sequence. the length will be related to the end position of the last feature of the missing sequence.

Would be nice to inform the user that a potential de-synchronization of the sequence names occurred between the gff and the fasta file.

In top of that, using --translate option will raise an error due to ??? codon does not exist. So if a warning before that error is displayed it could help the user to understand its problem.

Update for Python 3 compliant

It would be nice to make the tool python 3 compliant.

Require help in mapping GFF3 type (Column3) to EMBL qualifier

Hello! I have many ncRNA features (lncRNA,snoRNA,snRNA, etc.) in my GFF3 file. According to your instructions I included the following in the translation_gff_feature_to_embl_feature.json file:

"ncRNA_gene": {
"target": "ncRNA",
},
"snoRNA": {
"target": "ncRNA",
},
"lnc_RNA": {
"target": "ncRNA",
},

I am getting a new warning saying,

WARNING feature: The qualifier >ncRNA_class< is mandatory for the feature >ncRNA<. We will not report the feature.

I'm not quite sure how (and where) to add this qualifier ncRNA_class, seeing that it changes according to feature, ie., if the feature is a lnc_RNA, then it will map to ncRNA in the EMBL file but the ncRNA_class will be lncRNA, snoRNA -> ncRNA_class:snoRNA etc.

Issue with option --organelle

Traceback (most recent call last):
  File "/opt/6.x/python-2.7.2/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/opt/6.x/python-2.7.2/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/projet/fr2424/sib/lgueguen/git/gmod/EMBLmyGFF3/EMBLmyGFF3/__main__.py", line 4, in <module>
    main()
  File "/projet/fr2424/sib/lgueguen/git/gmod/EMBLmyGFF3/EMBLmyGFF3/EMBLmyGFF3.py", line 1277, in main
    writer.set_organelle( args.organelle )
  File "/projet/fr2424/sib/lgueguen/git/gmod/EMBLmyGFF3/EMBLmyGFF3/EMBLmyGFF3.py", line 964, in set_organelle
    organelle = self._verify( self.organelle, "organelle")
AttributeError: 'EMBL' object has no attribute 'organelle'

How to add in comment or CC line

How should you input text that you want to be incorporated into the CC comments or notes line?

bcbio not found

Hi,

I've followed the instructions to install EMBLmyGFF3 with git on Mac. I got the following error:

Download error on https://pypi.org/simple/bcbio-gff/: [Errno 54] Connection reset by peer -- Some packages may not be found! Couldn't find index page for 'bcbio-gff' (maybe misspelled?) Scanning index of all packages (this may take a while) Reading https://pypi.org/simple/ Download error on https://pypi.org/simple/: [Errno 54] Connection reset by peer -- Some packages may not be found! No local packages or working download links found for bcbio-gff==0.6.4 error: Could not find suitable distribution for Requirement.parse('bcbio-gff==0.6.4')

Does this mean I have to install bcbio myself before installing EMBLmyGFF3?

Cheers.

Feature qualifier values wrapped across multiple lines

Hello,

Thanks for making a very useful tool, I'm glad to have come across it.

There is an issue I'm having with line-wrapping. EMBLmyGFF3 appears to be wrapping lines at 80 characters. This creates problems with longer qualifier values, e.g. product names, because they are broken across several lines, often in the middle of a word. When one later runs the ENA flat file validator tool with the "fix" option, it unwraps the line but adds a space, so now the unwrapped name is broken.

Would it be possible to add an option to turn off line wrapping? Thank you!

-- Brandon

EMBL to fasta ?

Hi,

Thanks for the script it works well so far. During the process some of my scaffolds have been discarded (because too short for EBI).
Do you have a reverse script to convert the EMBL to FASTA in order for me to recompute some statistics (N50 etc.) or do you know a script that can perform directly some statistics from EMBL format ?

Best,

Bio.Alphabet has been removed from Biopython

Hi there,

I installed EMBLmyGFF3 via conda. When I try to run it, I get an error:

Traceback (most recent call last):
  File "/home/tagirdzh/miniconda/envs/EMBLmyGFF3/bin/EMBLmyGFF3", line 6, in <module>
    from EMBLmyGFF3 import main
  File "/home/tagirdzh/miniconda/envs/EMBLmyGFF3/lib/python3.8/site-packages/EMBLmyGFF3/__init__.py", line 3, in <module>
    from .EMBLmyGFF3 import *
  File "/home/tagirdzh/miniconda/envs/EMBLmyGFF3/lib/python3.8/site-packages/EMBLmyGFF3/EMBLmyGFF3.py", line 4, in <module>
    from .modules.feature import Feature
  File "/home/tagirdzh/miniconda/envs/EMBLmyGFF3/lib/python3.8/site-packages/EMBLmyGFF3/modules/feature.py", line 10, in <module>
    from Bio.Alphabet.IUPAC import *
  File "/home/tagirdzh/miniconda/envs/EMBLmyGFF3/lib/python3.8/site-packages/Bio/Alphabet/__init__.py", line 20, in <module>
    raise ImportError(
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the ``molecule_type`` as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

Thanks!

Implement a progress bar

It would be nice to implement a progress bar. Using tqdm ?

question: what additional options for mitochondria

I have a translation table of 4 for mitochondria but when create embl file it says there is a conflict between species translation table 1, how do I set organelle to get round this?

organelle = self._verify( self.organelle, "organelle")
AttributeError: 'EMBL' object has no attribute 'organelle'

WHy can't we use biopython > 1.67 h

emblmygff3 1.2.3 has requirement biopython==1.67, but you'll have biopython 1.72 which is incompatible.

Question: How to add more than 1 publication

Hi,

I need to add more than one publication, how would this be possible? I already tried to reuse the --ra, --rt, --rl parameter for each publication, but only the last one will be used. Of course I could do it manually. but the RP field won't be filled automatically anymore and I would have to do it severeal times which can be very time consuming.

Thank you very much in advance.

Best Regards,
Nadine

Find a way to make the json file more easily accessible

Since we have wrapped the tool up as python module to ease the installation and make sure that people use the correct version of the dependencies, the json mapping files are less easily accessible.
We could make them mandatory to have them where the tool is launched. If they are missing we copy past the default json files locally and use them.

Download

I've tried downloading the program with all 3 options but I get errors everytime I try to download it or run it and I'm not sure what is wrong. Is it because I have the updated version of python?

Traceback (most recent call last):
File "/Users/chengh1/miniconda3/bin/EMBLmyGFF3", line 33, in
sys.exit(load_entry_point('EMBLmyGFF3==2', 'console_scripts', 'EMBLmyGFF3')())
File "/Users/chengh1/miniconda3/bin/EMBLmyGFF3", line 25, in importlib_load_entry_point
return next(matches).load()
File "/Users/chengh1/miniconda3/lib/python3.8/importlib/metadata.py", line 77, in load
module = import_module(match.group('module'))
File "/Users/chengh1/miniconda3/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 783, in exec_module
File "", line 219, in _call_with_frames_removed
File "/Users/chengh1/miniconda3/lib/python3.8/site-packages/EMBLmyGFF3-2-py3.8.egg/EMBLmyGFF3/init.py", line 3, in
from .EMBLmyGFF3 import *
File "/Users/chengh1/miniconda3/lib/python3.8/site-packages/EMBLmyGFF3-2-py3.8.egg/EMBLmyGFF3/EMBLmyGFF3.py", line 4, in
from .modules.feature import Feature
File "/Users/chengh1/miniconda3/lib/python3.8/site-packages/EMBLmyGFF3-2-py3.8.egg/EMBLmyGFF3/modules/feature.py", line 8, in
from Bio.Seq import Seq
ModuleNotFoundError: No module named 'Bio'

EMBL flatfile not compatible with PAGIT RATT

Thanks for the conversion software. I have a query that might be a bit off track.

I am trying to use the output .embl (1.31 GB) from augustus gff3 with RATT software (run with linux) and get this error:

I am using the reference.fa. Please make sure that the description line of each fasta entry is the same than in the embl file name!

Just wonder if this is a known issue?

many thanks!

Bio.Alphabet issue

Hi,

Thanks for this software, I'm looking forward to using it. I've just installed it via conda, but I am having some issues that relate to an update of biopython. Can you tell me the versions of biopython and bcbio-gff that you were able to run the software on?

The install I ran today is running into issues with Bio.Alphabet. See the error:

    "Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the ``molecule_type`` as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information."
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the ``molecule_type`` as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

Here is my detailed conda environment information.

channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=1_gnu
  - bcbio-gff=0.6.6=pyh864c0ab_1
  - biopython=1.78=py37h8f50634_0
  - bx-python=0.8.9=py37h73d7ac5_2
  - ca-certificates=2020.6.20=hecda079_0
  - certifi=2020.6.20=py37hc8dfbb8_0
  - emblmygff3=2=py_0
  - ld_impl_linux-64=2.35=h769bd43_9
  - libblas=3.8.0=17_openblas
  - libcblas=3.8.0=17_openblas
  - libffi=3.2.1=he1b5a44_1007
  - libgcc-ng=9.3.0=h5dbcf3e_17
  - libgfortran-ng=7.5.0=hae1eefd_17
  - libgfortran4=7.5.0=hae1eefd_17
  - libgomp=9.3.0=h5dbcf3e_17
  - liblapack=3.8.0=17_openblas
  - libopenblas=0.3.10=pthreads_hb3c22a3_4
  - libstdcxx-ng=9.3.0=h2ae2ef3_17
  - lzo=2.10=h516909a_1000
  - ncurses=6.2=he1b5a44_1
  - numpy=1.16.4=py37h95a1406_0
  - openssl=1.1.1h=h516909a_0
  - pip=20.2.3=py_0
  - python=3.7.8=h6f2ec95_1_cpython
  - python-lzo=1.12=py37h81344f2_1001
  - python_abi=3.7=1_cp37m
  - readline=8.0=he28a2e2_2
  - setuptools=49.6.0=py37hc8dfbb8_1
  - six=1.15.0=pyh9f0ad1d_0
  - sqlite=3.33.0=h4cf870e_1
  - tk=8.6.10=hed695b0_1
  - wheel=0.35.1=pyh9f0ad1d_0
  - xz=5.2.5=h516909a_1
  - zlib=1.2.11=h516909a_1009

Programme failed: ... KeyboardInterrupt Terminated

Hi,

I've run EMBLmyGFF3 on a cluster in the following way:

EMBLmyGFF3 c_elegans.PRJNA13758.WS263.annotations.gff3 c_elegans.PRJNA13758.WS263.genomic.fa --topology linear --molecule_type 'genomic DNA' --transl_table 1 --species 'Caenorhabditis elegans' --locus_tag CELE --project_id PRJNA13758 -o c_elegans.PRJNA13758.WS263.annotations.embl

I have checked that I have Python 2.6, Biopython 1.67 and bcbio-gff 0.6.4. I can successfully pull out the help from EMBLmyGFF3.

I got the following output from my command above:

Traceback (most recent call last):
  File "/nfs/users/nfs_c/user/anaconda3/envs/python2env/bin/EMBLmyGFF3", line 11, in <module>
    load_entry_point('EMBLmyGFF3==1.2.3', 'console_scripts', 'EMBLmyGFF3')()
  File "/nfs/users/nfs_c/user/anaconda3/envs/python2env/lib/python2.7/site-packages/EMBLmyGFF3-1.2.3-py2.7.egg/EMBLmyGFF3/EMBLmyGFF3.py", line 1275, in main
    for record in GFF.parse(infile, base_dict=seq_dict):
  File "build/bdist.linux-x86_64/egg/BCBio/GFF/GFFParser.py", line 742, in parse
  File "build/bdist.linux-x86_64/egg/BCBio/GFF/GFFParser.py", line 322, in parse_in_parts
  File "build/bdist.linux-x86_64/egg/BCBio/GFF/GFFParser.py", line 343, in parse_simple
  File "build/bdist.linux-x86_64/egg/BCBio/GFF/GFFParser.py", line 637, in _gff_process
  File "build/bdist.linux-x86_64/egg/BCBio/GFF/GFFParser.py", line 667, in _lines_to_out_info
  File "build/bdist.linux-x86_64/egg/BCBio/GFF/GFFParser.py", line 189, in _gff_line_map
  File "build/bdist.linux-x86_64/egg/BCBio/GFF/GFFParser.py", line 89, in _split_keyvals
KeyboardInterrupt
Terminated

I submitted the command (in a .sh file) as a batch job with 20 MB memory on a GNU/Linux machine.

UnboundLocalError: local variable 'new_value' referenced before assignment

Hi,

in testing EMBLmyGFF3 to prepare an embl file with genome annotation for submission to EMBL, I'm getting the following error:

EMBLmyGFF3 scaffold1.gff scaffold1.fa --topology linear --transl_table 1 --molecule_type 'genomic DNA' --species 'Salix viminalis' --locus_tag TEST --project_id PRJEB00001 --de 'Single-molecule assembly' -o scaffold1.embl

    #############################################################################
    # NBIS 2018 - Sweden                                                        #
    # Authors: Martin Norling, Niclas Jareborg, Jacques Dainat                  #
    # Please visit https://github.com/NBISweden/EMBLmyGFF3 for more information #
    #############################################################################

12:25:01 WARNING feature: Unknown qualifier 'makerName' - skipped              ]
12:25:01 WARNING feature: Unknown qualifier '_QI' - skipped
12:25:01 WARNING feature: Unknown qualifier '_AED' - skipped
12:25:01 WARNING feature: Unknown qualifier '_eAED' - skipped
12:25:01 ERROR qualifier: local variable 'new_value' referenced before assignment
Traceback (most recent call last):
  File "/opt/pyenv/versions/2.7.10/envs/EMBLmyGFF3_venv/lib/python2.7/site-packages/EMBLmyGFF3/modules/qualifier.py", line 88, in _by_value_format
    formatted_value=new_value
UnboundLocalError: local variable 'new_value' referenced before assignment

I'm just using one scaffold for testing purposes. The head of the GFF file looks like this:

##gff-version 3
scaffold1	repeatmasker	match	2	581	896	+	.	ID=scaffold1:hit:709:1.3.0.0;Name=species:rnd-4_family-62|genus:Unspecified;Target=species:rnd-4_family-62|genus:Unspecified 253 760 +
scaffold1	maker	gene	444	3240	.	+	.	ID=salix_viminalisG00000000001;Name=at3g47200_37;makerName=genemark-scaffold1-processed-gene-0.6
scaffold1	maker	mRNA	444	3240	.	+	.	ID=salix_viminalisT00000000001;Parent=salix_viminalisG00000000001;Dbxref=PFAM:PF03140,InterPro:IPR004158;Name=at3g47200_37;_AED=0.25;_QI=0|0|0|0.25|1|1|4|0|412;_eAED=0.25;makerName=genemark-scaffold1-processed-gene-0.6-mRNA-1;product=UPF0481 protein At3g47200
scaffold1	maker	exon	444	618	.	+	.	ID=salix_viminalisE00000000001;Parent=salix_viminalisT00000000001;makerName=genemark-scaffold1-processed-gene-0.6-mRNA-1:1
scaffold1	maker	CDS	444	618	.	+	0	ID=salix_viminalisC00000000001;Parent=salix_viminalisT00000000001;makerName=genemark-scaffold1-processed-gene-0.6-mRNA-1:cds
scaffold1	maker	exon	827	912	.	+	.	ID=salix_viminalisE00000000002;Parent=salix_viminalisT00000000001;makerName=genemark-scaffold1-processed-gene-0.6-mRNA-1:2
scaffold1	maker	CDS	827	912	.	+	2	ID=salix_viminalisC00000000001;Parent=salix_viminalisT00000000001;makerName=genemark-scaffold1-processed-gene-0.6-mRNA-1:cds
scaffold1	maker	exon	2167	3072	.	+	.	ID=salix_viminalisE00000000003;Parent=salix_viminalisT00000000001;makerName=genemark-scaffold1-processed-gene-0.6-mRNA-1:3
scaffold1	maker	CDS	2167	3072	.	+	0	ID=salix_viminalisC00000000001;Parent=salix_viminalisT00000000001;makerName=genemark-scaffold1-processed-gene-0.6-mRNA-1:cds

Note that I have a local translation_gff_feature_to_embl_feature.json mapping "match" to "repeat_region":

 "match": {
 	"target": "repeat_region"
 },

The head of the fasta file is as:

>scaffold1
TAAAATAAAAAAAATCGGGTCGGGCCCAACTAAATGGGCCGGCTAGCCAAGATGGGCCAAAGCCCAATTTTAATGGGCTGGGCCGAGAGCCGTCCAGCCC
AAGACACGCAGAAGAGGAAAAAAAGAAAAGGGCAAAACGGCACTGTTTAGCACATGTTAATTAAACAGTTTACGTATTTCGTGAACAGTAAAATGGTGGT
CGGCCGACCACGACGAGAGGGTCACCTGCTATTGCCGCCGGGTAGAGGAGGTCGAGGTGGTTGTCCCTGTGGTTGTGGAGTCGAAAATGGTGGCCCGTGG
CGGCCGGAGGAGGCGTTGGAAGTGGCCGGTCTGTTGCTTGCTGTCGTCGGGGCTGTCACTGTTTCTTCGCCGGAGAGGACGACTGAGCTGCTGGAGTTGA
GGGGAGGCTGAAGGTGGTGATGAGGGTGGATATGGGTGGTTGAATGGTGGCTGTTGGAGGAGAGAGAGAGACGCCGGGTCCTCTGGTTTTAGAGAGAGAA
TGCTGTCGGGGAGAGAGAAAGGAGCTGCAACAGGCTGAGAGACGAAGGAGAGAGAGAGAGAAAGGGCTGCTGTGTCGCCGGAGCTGGAGAGGAAGAAAGG
GTGGCTGCCTCTGCGTGTGTATGCTTGTGTTCTGCAAATTTACCACGTCTTCGTCTTCCTCCTCCAGCCTTAATTTGAAACTGAAACTAAAATATTCGCC
TCTGTTCTCTCAAAACTTCTCAGTTTCTTCCTTGCTTTTCTTTGCCCAAATTTCTGTCGATTTTCCTCCCGTTTTTTCTCCCTTCTTTCTCCCCCTTCTG
CATGCATTTCATGCATGTATTTATAGGTTTGAAAGGCAACCCTTCAGCTGCCCATGGCGTGCAGCGAAGGGTTGCCGCCTGTGATTGCAGGTGGCGTGCC

Also, I installed EMBLmyGFF3 in a virtual environment, but when I run the maker example with the command EMBLmyGFF3-maker-example, I do get the correct output EMBLmyGFF3-maker-example.embl and without errors.

I can't seem to spot possible formatting errors in the input files but I'm just using this tool for the first time so there may be something that I'm missing.

Any help would be kindly appreciated.
Many thanks,
Pedro

Locus tag clarification

Hi,

Thanks for developing this tool. It would be very useful for me.

Regarding the requirement to write the locus tag in the command (--locus_tag MY_LOCUS_TAG), do you mean the locus tag prefix, as described here? So, for Caenorhabditis elegans, this would be CELE because all gene features start with CELE?

Spread feature like CDS are not collectively linked when L1 and/or L2 feature missing

If l1 (e.g. gene) and l2 feature (e.g. mRNA) are missing for several CDS that must be collectively linked (one CDS several position in the EMBL file), the tool create one EMBL CDS feature per GFF CDS feature.
Would be nice to deal with that.
To deal with such case, currently we need to run agat_sp_gxf_to_gff.pl from AGAT to create the missing L1, L2 features.

Error in translations

issue reported by an user:

When i used the option --translate, some CDSs were translated error in embl file. For instance

In the gff3
Chr_1 AUGUSTUS        gene    55249   56486   0.84    -       .       ID=g13
Chr_1 AUGUSTUS        transcript      55249   56486   0.84    -       .       ID=g13.t1;Parent=g13
Chr_1 AUGUSTUS        stop_codon      55249   55251   .       -       0       Parent=g13.t1
Chr_1 AUGUSTUS        intron  55679   55753   0.95    -       .       Parent=g13.t1
Chr_1 AUGUSTUS        intron  55904   55957   0.94    -       .       Parent=g13.t1
Chr_1 AUGUSTUS        intron  56015   56069   1       -       .       Parent=g13.t1
Chr_1 AUGUSTUS        intron  56228   56296   1       -       .       Parent=g13.t1
Chr_1 AUGUSTUS        intron  56394   56472   0.99    -       .       Parent=g13.t1
Chr_1 AUGUSTUS        CDS     55249   55678   0.91    -       1       ID=g13.t1.cds;Parent=g13.t1
Chr_1 AUGUSTUS        CDS     55754   55903   0.94    -       1       ID=g13.t1.cds;Parent=g13.t1
Chr_1 AUGUSTUS        CDS     55958   56014   0.94    -       1       ID=g13.t1.cds;Parent=g13.t1
Chr_1 AUGUSTUS        CDS     56070   56227   1       -       0       ID=g13.t1.cds;Parent=g13.t1
Chr_1 AUGUSTUS        CDS     56297   56393   0.99    -       1       ID=g13.t1.cds;Parent=g13.t1
Chr_1 AUGUSTUS        CDS     56473   56486   1       -       0       ID=g13.t1.cds;Parent=g13.t1
Chr_1 AUGUSTUS        start_codon     56484   56486   .       -       0       Parent=g13.t1
# protein sequence = [MSLIRDSGPRRLVDGFWEYGRYYGSWRPRKYLFPIDAEELNRMDIFHKFFLVARDEALFASPLDPNRDQPLRILDLGT
#GTGIWAINVAEVTAVPPEIMVVDLHQIQPALIPLGISPLQFDIEEASWEPLMKDCDLVHIRMLYGSIQTDLWPDIYHKTFEHLKPGSGYIEHIEIDWV
#PRWDGNDVPPESSLHEWSQLLLRGLDRFNRNARIDVGEVRITLDKAGFVDFREETIRCYVNPWSSERREREIARWFNLGLSQCLEAMSLMPMIEGLSM
# TKEQVKELCDRAKKEICILRYHAYMTL]


In the converted embl
CDS             complement(join(55249..55678,55754..55903,55958..56014,
FT                   56070..56227,56297..56393,56473..56486))
FT                   /locus_tag="LOCUSTAG_LOCUS13"
FT                   /codon_start=1
FT                   /note="source:AUGUSTUS"
FT                   /note="ID:g13.t1.cds"
FT                   /translation="CP*LEIQGQGVL*MDFGSMAGIMAHGDRGSICSRLTRRNLTGWTS
FT                   FTSSS*LLETKLYLPPHWTRTGTNPFEYLILELVPEYGPLMLQK*LLFHRRSWLWISIR
FT                   FSQPSFPSVFLPYNLTSKKHHGSL**KIATWCTYECSMAVSRPICGQIYTIKLSNI*SL
FT                  GLDT*NTLKSIGCPGGMETTSRPSHRCMNGPSYYCEAWIVSTGMPELMWGKFE*PSTRP
FT                   GSSISEKRPFGAT*THGPRSVVSGKLRDGSTSGFLNVSRR*V*CP**RG*V*PKNKSRS
FT                   SVTGPKRRFAYCAITLI*RC"
FT                   /transl_table=1

Any suggestions?

Thanks
Edison

Confusion with the locus tag

Hi,

There are some rules for EBI about the locus tag:

https://ena-docs.readthedocs.io/en/latest/faq/locus_tags.html

I used your script to create a EMBL flat file but for each locus, _LOCUSXX (with XX a number) is added after my prefix, example : PRE_LOCUSXX

In their example, a locus tag would be /locus_tag="BN5_00001". Therefore, is PRE_LOCUS will be considered as a prefix by EBI and then refused because of this rule ?

All characters must be alphanumeric with none such as -_*

Webin-CLI validation failing due to duplicated feature locations in EMBLmyGFF3 flat file

I am getting the following errors when I try to validate my EMBLmyGFF3-generated flat file through Webin-CLI.

head genome/Pmacd_v0.10/validate/Pmacd_v0.10_ENAsubmit.embl.gz.report
ERROR: "exon" Features locations are duplicated - consider merging qualifiers. [ line: 6951 of Pmacd_v0.10_ENAsubmit.embl.gz,  line: 6921 of Pmacd_v0.10_ENAsubmit.embl.gz]
ERROR: "exon" Features locations are duplicated - consider merging qualifiers. [ line: 7321 of Pmacd_v0.10_ENAsubmit.embl.gz,  line: 7283 of Pmacd_v0.10_ENAsubmit.embl.gz]
ERROR: "exon" Features locations are duplicated - consider merging qualifiers. [ line: 13785 of Pmacd_v0.10_ENAsubmit.embl.gz,  line: 13757 of Pmacd_v0.10_ENAsubmit.embl.gz]

My gff has the format:

Pmacd_v0.10_Sc0000027_pilon     AUGUSTUS        gene    14395   28338   .       -       .       ID=PmacdG00000006135
Pmacd_v0.10_Sc0000027_pilon     AUGUSTUS        transcript      14395   28338   .       -       .       ID=PmacdG00000006135.1;Parent=PmacdG00000006135
Pmacd_v0.10_Sc0000027_pilon     AUGUSTUS        exon    14395   14538   .       -       .       ID=PmacdG00000006135.1-exon1;Parent=PmacdG00000006135.1
Pmacd_v0.10_Sc0000027_pilon     AUGUSTUS        exon    25250   25354   .       -       .       ID=PmacdG00000006135.1-exon2;Parent=PmacdG00000006135.1
Pmacd_v0.10_Sc0000027_pilon     AUGUSTUS        exon    28297   28338   .       -       .       ID=PmacdG00000006135.1-exon3;Parent=PmacdG00000006135.1
Pmacd_v0.10_Sc0000027_pilon     AUGUSTUS        CDS     14395   14538   .       -       0       ID=PmacdG00000006135.1-cds1;Parent=PmacdG00000006135.1
Pmacd_v0.10_Sc0000027_pilon     AUGUSTUS        CDS     25250   25354   .       -       0       ID=PmacdG00000006135.1-cds1;Parent=PmacdG00000006135.1

I ran EMBLmyGFF3 using the command:

/home/racste/.local/bin/EMBLmyGFF3  Pmacd_v0.10_braker_gffread_merge_mod_nogeneID.gff Pmacd_v0.10.fasta \
        --topology linear \
        --molecule_type 'genomic DNA' \
        --transl_table 1  \
        --species 'Pieris macdunnoughii' \
        --project_id PRJEB42400 \
        -o result.embl \
        -locus_tag PMACD

Of course my first thought was to remove the duplicates with agat_sp_fix_features_locations_duplicated.pl, but there were no duplicates detected:

=> OmniscientI total time: 198 seconds
Pmacd_v0.10_braker_gffread_merge_mod_nogeneID.gff file parsed

We found 0 cases where isoforms have identical exon structures (we removed duplicates by keeping the one with longest CDS).
We found 0 cases where l2 from different gene identifier have identical exon but no CDS at all (we removed one duplicate).
We found 0 cases where l2 from different gene identifier have identical exon and CDS structures (we removed duplicates by keeping the one with longest CDS).
We found 0 cases where l2 from different gene identifier have identical exon structures (we reshaped UTRs to modify gene locations).
Whe removed 0 genes because no more l2 were linked to them.
We found 0 cases where 2 genes have same location while CDS are differents. In that case we modified the gene locations by clipping UTRs.

Here's an example of one of the overlapping features from all three files:

# webin-CLI report
ERROR: "exon" Features locations are duplicated - consider merging qualifiers. [ line: 6951 of Pmacd_v0.10_ENAsubmit.embl.gz,  line: 6921 of Pmacd_v0.10_ENAsubmit.embl.gz]

# EMBLmyGFF3 output
#line: 6921-6924 of Pmacd_v0.10_ENAsubmit.embl.gz
FT   exon            complement(2756392..2756483)
FT                   /locus_tag="PMACD_LOCUS154"
FT                   /note="ID:PmacdG00000009802.2-exon5"
FT                   /note="source:AUGUSTUS"

#line: 6951-6954 of Pmacd_v0.10_ENAsubmit.embl.gz
FT   exon            complement(2756392..2756483)
FT                   /locus_tag="PMACD_LOCUS155"
FT                   /note="ID:PmacdG00000009803.1-exon2"
FT                   /note="source:GeneMark.hmm"

# gff3 input
# relevant exon is starred (**)
grep PmacdG00000009802 *nogeneID.gff
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    gene    2751271 2756483 .       -       .       ID=PmacdG00000009802
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    transcript      2751271 2752952 .       -       .       ID=PmacdG00000009802.1;Parent=PmacdG00000009802
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    exon    2751271 2751537 0       -       .       ID=PmacdG00000009802.1-exon1;Parent=PmacdG00000009802.1
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    exon    2752178 2752342 0       -       .       ID=PmacdG00000009802.1-exon2;Parent=PmacdG00000009802.1
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    exon    2752767 2752952 0       -       .       ID=PmacdG00000009802.1-exon3;Parent=PmacdG00000009802.1
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    CDS     2751271 2751537 .       -       0       ID=PmacdG00000009802.1-cds1;Parent=PmacdG00000009802.1
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    CDS     2752178 2752342 .       -       0       ID=PmacdG00000009802.1-cds1;Parent=PmacdG00000009802.1
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    CDS     2752767 2752952 .       -       0       ID=PmacdG00000009802.1-cds1;Parent=PmacdG00000009802.1
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        transcript      2751271 2756483 .       -       .       ID=PmacdG00000009802.2;Parent=PmacdG00000009802
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        exon    2751271 2751537 .       -       .       ID=PmacdG00000009802.2-exon1;Parent=PmacdG00000009802.2
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        exon    2752178 2752342 .       -       .       ID=PmacdG00000009802.2-exon2;Parent=PmacdG00000009802.2
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        exon    2752767 2752930 .       -       .       ID=PmacdG00000009802.2-exon3;Parent=PmacdG00000009802.2
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        exon    2755781 2755980 .       -       .       ID=PmacdG00000009802.2-exon4;Parent=PmacdG00000009802.2
** Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        exon    2756392 2756483 .       -       .       ID=PmacdG00000009802.2-exon5;Parent=PmacdG00000009802.2
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        CDS     2751271 2751537 .       -       0       ID=PmacdG00000009802.2-cds1;Parent=PmacdG00000009802.2
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        CDS     2752178 2752342 .       -       0       ID=PmacdG00000009802.2-cds1;Parent=PmacdG00000009802.2
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        CDS     2752767 2752930 .       -       2       ID=PmacdG00000009802.2-cds1;Parent=PmacdG00000009802.2
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        CDS     2755781 2755980 .       -       1       ID=PmacdG00000009802.2-cds1;Parent=PmacdG00000009802.2
Pmacd_v0.10_Sc0000000_pilon     AUGUSTUS        CDS     2756392 2756483 .       -       0       ID=PmacdG00000009802.2-cds1;Parent=PmacdG00000009802.2

# relevant exon is starred (**)
grep PmacdG00000009803 *nogeneID.gff
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    gene    2755740 2756483 .       -       .       ID=PmacdG00000009803
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    transcript      2755740 2756483 .       -       .       ID=PmacdG00000009803.1;Parent=PmacdG00000009803
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    exon    2755740 2755980 0       -       .       ID=PmacdG00000009803.1-exon1;Parent=PmacdG00000009803.1
** Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    exon    2756392 2756483 0       -       .       ID=PmacdG00000009803.1-exon2;Parent=PmacdG00000009803.1
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    CDS     2755740 2755980 .       -       1       ID=PmacdG00000009803.1-cds1;Parent=PmacdG00000009803.1 
Pmacd_v0.10_Sc0000000_pilon     GeneMark.hmm    CDS     2756392 2756483 .       -       0       ID=PmacdG00000009803.1-cds1;Parent=PmacdG00000009803.1

Do you have any other suggestions of how to fix this error so I can validate and submit my flat file?

Thanks!
Rachel

Annotation description is missing in output file

Hi,
I really like the tool but, there is a little problem when I convert my .gff file to .embl file. I cannot see the annotation information in the output file format. Rest it seems okay. I am copying a subset of my data as text file in this email for your review. Can you please comment?

subset_embl.txt

annotated gff3.txt

Issue with translating genes on complement strand

Hello,

I've come across an issue with how CDS features are printed for genes encoded on the complementary strand. The problem manifests itself clearly when using the --translate flag, as it produces lots of erroneous translations riddled with stop codons *.

I give an example below.

The EMBL output for an affected gene looks like:

FT   gene            complement(123273..128445)
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.g00007"
FT   mRNA            complement(join(128366..128445,126919..127115,124188..124
FT                   406,123273..123484))
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007"
FT   exon            complement(128366..128445)
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007-E1"
FT   exon            complement(126919..127115)
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007-E2"
FT   exon            complement(124188..124406)
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007-E3"
FT   exon            complement(123273..123484)
FT                   /locus_tag="BANY_locus6"
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007-E4"
FT   CDS             complement(join(<128366..128445,126919..127115,124188..12
FT                   4406,123273..>123484))
FT                   /locus_tag="BANY_locus6"
FT                   /codon_start=1
FT                   /note="source:GenomeHubs"
FT                   /note="ID:BANY.1.2.t00007-CDS"
FT                   /translation="QKFI*SNIWC*HLVIRS*TTNALTLVCVTFSACRRGSSIRCRVVS
FT                   LHVAAALSSRAMEIPPRAMTTPL*VSS*QTNMDRE*RASNDRHTVVQRNVWRTCEDRKI
FT                   DS*RRNSNRKRLSV*GRCR*CCF*MWFR*L**MGSSYKL*FGEKCEIIKISKPIKSHWA
FT                   KENNLNLNELLSDGEYKELYRLAMIKWSEDMREKDYGCFCRAACENDVSTSNFTVQR*E
FT                   KVWQRFFN*SLKRK"
FT                   /transl_table=1

The mRNA feature looks fine, but there are some puzzling < and > characters in the CDS feature that I think may be the problem. The translation is then subsequently messed up, and in fact appears to be the translation for the exons in reverse order, as QKFI* corresponds to the first 4 "codons" of the last exon (E4, 123273..123484).

Hopefully an easy issue, and thanks for a great tool, this is going to extremely useful :-)

Or maybe something funny in the GFF? the entry for this gene is:

BANY00001       GenomeHubs      gene    123273  128445  .       -       .       ID=BANY.1.2.g00007
BANY00001       GenomeHubs      mRNA    123273  128445  .       -       .       ID=BANY.1.2.t00007;Parent=BANY.1.2.g00007
BANY00001       GenomeHubs      exon    128366  128445  .       -       .       ID=BANY.1.2.t00007-E1;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      exon    126919  127115  .       -       .       ID=BANY.1.2.t00007-E2;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      exon    124188  124406  .       -       .       ID=BANY.1.2.t00007-E3;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      exon    123273  123484  .       -       .       ID=BANY.1.2.t00007-E4;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      CDS     128366  128445  .       -       0       ID=BANY.1.2.t00007-CDS;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      CDS     126919  127115  .       -       2       ID=BANY.1.2.t00007-CDS;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      CDS     124188  124406  .       -       1       ID=BANY.1.2.t00007-CDS;Parent=BANY.1.2.t00007
BANY00001       GenomeHubs      CDS     123273  123484  .       -       1       ID=BANY.1.2.t00007-CDS;Parent=BANY.1.2.t00007

Running biopython version: 1.67 and bcbio-gff version: 0.6.4

Convert embl to gbk

Hi Jacques
Could you please add some functions to convert embl to gbk?

Best
Edison

string index out of range (when sequence end by Ns)

Original question from @Iseez
Just one question more, when i was tryng to obtain the embl for a different species i encountered the following error:

Traceback (most recent call last):                                             ]
  File "/cm/shared/apps/emblmygff3/1.2.6/bin/EMBLmyGFF3", line 11, in <module>
    load_entry_point('EMBLmyGFF3==1.2.6', 'console_scripts', 'EMBLmyGFF3')()
  File "/cm/shared/apps/emblmygff3/1.2.6/lib/python2.7/site-packages/EMBLmyGFF3-1.2.6-py2.7.egg/EMBLmyGFF3/EMBLmyGFF3.py", line 1383, in main
    writer.write_all( outfile )
  File "/cm/shared/apps/emblmygff3/1.2.6/lib/python2.7/site-packages/EMBLmyGFF3-1.2.6-py2.7.egg/EMBLmyGFF3/EMBLmyGFF3.py", line 1179, in write_all
    self._add_mandatory()
  File "/cm/shared/apps/emblmygff3/1.2.6/lib/python2.7/site-packages/EMBLmyGFF3-1.2.6-py2.7.egg/EMBLmyGFF3/EMBLmyGFF3.py", line 195, in _add_mandatory
    if seq[end] == 'n' :
IndexError: string index out of range

Is the problem due to the files I'm using as input?

Features locations are duplicated - consider merging qualifiers

Thanks for this nice tool. I'm running into an issue trying to validate embl files that were generated on your tool. I'm using webin-cli-1.7.1 and it throws up the below error when I try to validate/submit the embl files

ERROR: "tRNA" Features locations are duplicated - consider merging qualifiers.

The command-line I used is this:

EMBLmyGFF3 test/6666666.419437.gff test/6666666.419437.contigs.fa -o test/test_new.embl

Any help in this regard would be highly appreciated

Installation inconveniences

Hello all,

thank you for your tool - very useful in dealing with EMBL/GBK/GFF3 formatting nightmare.

I've recommended it to several people who need to do submissions of annotated genomes. One thing that makes it hard to use is the fact that most people use bioconda now - thus both python and pip are installed via conda. This breaks EMBLmyGFF3 - neither installation command specified in the readme provides a working script. If you could make it easier to install in a cluster environment, it would be great.

Is the option --locus_numbering_start working?

Hi,

It seems to me that the --locus_numbering_start parameter is not working.
I provide an integer for this parameter (e.g. --locus_numbering_start 30) but it is not taken into consideration.
Could this be related to a 10 step increment in my gff3 input file coming from Prokka?
Thanks in advance, best

Feature Request: locus_tag parameter

Hi,

thank you for this great tool! It is of great help!

I would like to ask if it would be possible to introduce a new parameter with which the tool uses an already existing attribute as locus_tag, e.g. the ID or just locus_tag itself.
Thank You
Best Regards
Nadine

dx_xref tool crash when one db not accepted

We should just not report db_xref not accepted