Comments (5)
I just added a quick and dirty try/except on all the locus_tag getters and the conversion went fine and skipped the features that didn't have this qualifier.
Something like: (~ line 122 in the file)
elif feat.type == 'tRNA':
try:
locus_tag = feat.qualifiers['locus_tag'][0]
except:
print("WARNING: The following feature was skipped due to lack of locus_tag:\n{0}".format(feat))
features_skipped_count += 1
Maybe an if statement would have been prettier tho...
from biocode.
Good catch. Would you like to submit a pull request for attribution?
from biocode.
Sure thing, I'll submit tomorrow
from biocode.
Just a quick glance over the genbank file that lead to the error (for readability I removed some useless informations but you can find the whole file here):
On this first example, you can see that the tRNA feature does not have the locus_tag
and that they are located outside the gene.
gene 83175..83459
/gene="Z254R"
/locus_tag="ATCV1_Z254R"
/db_xref="GeneID:5470877"
CDS 83175..83459
/gene="Z254R"
/locus_tag="ATCV1_Z254R"
[...]
misc_feature <83208..>83435
/gene="Z254R"
/locus_tag="ATCV1_Z254R"
[...]
tRNA 83727..83793
/product="tRNA-Ser"
tRNA 83799..83872
/product="tRNA-Arg"
[...]
tRNA 84456..84528
/product="tRNA-Lys"
On this second example, we have the tRNA feature without a locus_tag
but, its coordinates are within the range of the previous gene:
gene complement(84467..84664)
/gene="Z255L"
/locus_tag="ATCV1_Z255L"
/db_xref="GeneID:5470549"
CDS complement(84467..84664)
/gene="Z255L"
/locus_tag="ATCV1_Z255L"
/codon_start=1
/product="hypothetical protein"
/protein_id="YP_001426736.1"
/db_xref="GeneID:5470549"
/translation="MTSSLPLRLESADSALIGIEPMTPWLTAKCSNRLSYSALHLKSA
DLPDVGLEPTTTRLRALRSAD"
tRNA 84551..84624
/product="tRNA-Asn"
And finally we have the best behavior possible, the one that you already covered: both the locus_tag
and within the range:
gene 84655..84966
/gene="Z256R"
/locus_tag="ATCV1_Z256R"
/db_xref="GeneID:5470266"
CDS 84655..84966
/gene="Z256R"
/locus_tag="ATCV1_Z256R"
/codon_start=1
/product="hypothetical protein"
/protein_id="YP_001426737.1"
/db_xref="GeneID:5470266"
/translation="MKSFVILRSPVRSWLGPKYQGPIAQLGERKTEVFGISPIRQHRL
VRSRPGVLRTSPLMRAWVRIPLLPFLIYHICAVSNPNQLCNHSIKNAQKQTYEVLGEK
R"
tRNA 84773..84858
/gene="Z256R"
/locus_tag="ATCV1_Z256R"
/product="tRNA-Leu"
/db_xref="GeneID:5470266"
Now, knowing that, what behavior do you prefer between those two ?
1st: If the locus_tag
qualifier is absent, just skip the whole feature. I've tried it only for tRNA but I apply it to the mRNA and rRNA aswell pretty easily.
elif feat.type == 'tRNA':
try:
locus_tag = feat.qualifiers['locus_tag'][0]
except KeyError:
print("WARNING: The following feature was skipped due to missing locus_tag qualifier:\n{0}".format(feat))
features_skipped_count += 1
continue
2nd: Or I've tried a second version which is checking if the feature is located inside the current_gene
, if it is then we'll add it, and if it's not we'll skip it:
elif feat.type == 'tRNA':
try:
locus_tag = feat.qualifiers['locus_tag'][0]
except KeyError:
# If the feature does not have a locus_tag, we can check if
# it's inside the current_gene or not. If it is then we can
# add it using the last locus_tag in memory. If it isn't we
# can skip the feature.
pass
rna_count_by_gene[locus_tag] += 1
feat_id = "{0}.tRNA.{1}".format( locus_tag, rna_count_by_gene[locus_tag] )
tRNA = things.tRNA(id=feat_id, parent=current_gene)
tRNA.locate_on( target=current_assembly, fmin=fmin, fmax=fmax, strand=strand )
if not tRNA.contained_within(current_gene):
print("WARNING: The following feature was skipped:\n{0}".format(feat))
features_skipped_count += 1
rna_count_by_gene[locus_tag] -= 1 # remove the feature from the count since we drop it
continue
gene.add_tRNA(tRNA)
current_RNA = tRNA
...
In this solution I'm using the last locus_tag
in memory which should be the same as the current gene.
from biocode.
Well isn't this ugly. According to the Genbank flat file spec for these features:
RNA features (rRNA, tRNA, ncRNA) must include a corresponding gene feature
with a locus_tag qualifier.
In their examples the gene and ncRNA feature coordinates match. I'd rather annotations not be lost in the process, so the safest might be to print warnings anytime the gene and ncRNA coordinates don't match, then save the gene/ncRNA pairing if they overlap or if there are no other features the previous gene id is associated with.
The more I look though, the uglier this gets. That full annotation file you linked has an entire block of tRNAs with no associated genes:
tRNA 83727..83793
/product="tRNA-Ser"
tRNA 83799..83872
/product="tRNA-Arg"
tRNA 83898..83968
/product="tRNA-Gly"
tRNA 83991..84062
/product="tRNA-Asp"
tRNA 84086..84158
/product="tRNA-Val"
tRNA 84181..84253
/product="tRNA-Val"
tRNA 84276..84349
/product="tRNA-Asn"
tRNA 84372..84453
/product="tRNA-Tyr"
tRNA 84456..84528
/product="tRNA-Lys"
Also, the one you mention that I already had covered 'ATCV1_Z256R' is actually a protein coding gene whose coordinates happen to overlap their annotated tRNA. They don't have anything to do with each other, so the current version would actually erroneously assign the locus tag to that tRNA and it shouldn't have.
The strict solution here would be erroring if there's an ncRNA feature without matching coordinates on a gene feature, and referencing the NCBI spec.
The little bit nicer one is to use the locus_tag for an ncRNA feature only if the coordinates match, else export the tRNA/gene stack to GFF3 without a locus tag for these free-floating annotations.
Happy to hear opinions.
from biocode.
Related Issues (20)
- convert_gff3_to_ncbi_tbl HOT 5
- Syntax error on gff.py HOT 4
- Exclude mRNA features in bacterial TBL exports
- Attribute error for update_selected_column9_values.py HOT 1
- Biocode.gff module error HOT 2
- AttributeError: type object 'str' has no attribute 'maketrans' HOT 2
- AttributeError: 'Gene' object has no attribute 'add_CDS' HOT 4
- Insert EC numbers into chado database issue HOT 5
- convert_augustus_to_gff3.py error HOT 6
- Conda based install HOT 2
- convert_gff3_to_ncbi_tbl.py HOT 2
- convert_gff_to_ncbi_tbl.py HOT 3
- Incorrect parent features from convert_tRNAScanSE_to_gff3.pl HOT 2
- Formatting Issue? HOT 2
- biocode error HOT 16
- product info not printout in tbl HOT 2
- fasta/fasta_simple_stats.py fails on any file with only one sequence
- [convert_genbank_to_gff3.py] No Locus_tag present in my genbank file HOT 2
- Python upgrade/conversion? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from biocode.