Comments (2)
Debugging revealed that the problem lies with get_gff3_features in biocodegff.py. There are two problems with this function:
- It interprets any GFF line starting with "##FASTA" as the beginning of the GFF3 FASTA section. I don't think this is correct, at least based on my reading of the GFF3 spec. I'd say that "##FASTA" (possibly with trailing whitespace) is the only thing that should be allowed on the line.
- It fails to raise an error when it sees (what it thinks is) FASTA sequence data that is not preceded by a valid FASTA defline. This is actually the more serious of the two errors, and is the one that allows thousands of input genes to be silently dropped.
The problem in this particular case is that MetaGeneMark prints the protein sequences in comments in the GFF, like this:
##Protein 11140
##MAQFRVDSEQIQQAAAAVGTSVSAIRDAVNGMYTNLQQLQSVWTGSAATQFASTAQQWRA
##AQQQMEQSLEAIQQAMQHASGVYLDAEAQATSLFGMG
##end-Protein
convert_metagenemark_gff_to_gff3.py echoes these comments to the GFF3 file unchanged, which it should not do, because..."FASTA" is a valid amino acid sequence. So all it takes is for "FASTA" to appear at the beginning of any protein line and biocodegff.py will ignore the entire rest of the file, without printing any errors or warnings:
##Protein 10298
##MRMQKVQKKLSETSFQDRLDFAATHSKTSVLRMCNSQCTGLCARDVLRARARFGSNALER
##KKQNSLASRLVQAFINPFSCILFVLALISCINDMVLPSLSLLGQSPDDFDCTTFTIITTM
##ITVSGILRFVQESKSANAAQKLMDMVRTTVSCLRDGDADEDAVSPSTSATASPSASASLA
##NFSFEDKAKLTEIQLDSLVVGDIVYLSTGDIVPADVRILSACDLFVNEASLTGESELVEK
##FASTATKAANICDYENLAFMGTTVISGSAWAVVVSVGAHTMFGTLARALSEKDGETSFSR
<everything from this point on, except FASTA sequence, is ignored by the parser>
##DINSLSWVLIRFMIVMVPVVLAINGFTKGDW
##end-Protein
I have a partial proposed solution, which consists of:
- Requiring that the "##FASTA" be the only thing on the line (although note that this doesn't solve the MetaGeneMark bug, just makes it less likely to be triggered.)
- Throwing an exception when what the parser thinks is the FASTA section is not well-formed. In particular, all FASTA sequence must be preceded by a FASTA defline.
Finally, I'll file this as a separate issue, but I think convert_metagenemark_gff_to_gff3.pl should be modified so that it doesn't produce GFF3 that's ambiguous/malformed i.e., no "##FASTA" lines unless it's at the start of a valid FASTA section. One possibility might be to tack an extra "#" on all the echoed comment lines.
from biocode.
Closed this prematurely!
from biocode.
Related Issues (20)
- convert_gff3_to_ncbi_tbl HOT 5
- Syntax error on gff.py HOT 4
- Exclude mRNA features in bacterial TBL exports
- Attribute error for update_selected_column9_values.py HOT 1
- Biocode.gff module error HOT 2
- [convert_genbank_to_gff3.py] key_error: locus_tag HOT 5
- AttributeError: type object 'str' has no attribute 'maketrans' HOT 2
- AttributeError: 'Gene' object has no attribute 'add_CDS' HOT 4
- Insert EC numbers into chado database issue HOT 5
- convert_augustus_to_gff3.py error HOT 6
- Conda based install HOT 2
- convert_gff3_to_ncbi_tbl.py HOT 2
- convert_gff_to_ncbi_tbl.py HOT 3
- Incorrect parent features from convert_tRNAScanSE_to_gff3.pl HOT 2
- Formatting Issue? HOT 2
- biocode error HOT 16
- product info not printout in tbl HOT 2
- fasta/fasta_simple_stats.py fails on any file with only one sequence
- [convert_genbank_to_gff3.py] No Locus_tag present in my genbank file HOT 2
- Python upgrade/conversion? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from biocode.