foerstner-lab / gffpandas Goto Github PK
View Code? Open in Web Editor NEWParse GFF3 into Pandas dataframes
License: ISC License
Parse GFF3 into Pandas dataframes
License: ISC License
I have a gff file containing a section starting with "##FASTA". gffpd.read_gff3 raises an error on it.
When I delete the section after "##FASTA" the function gffpd.read_gff3 works, so it looks like an easy to fix issue.
Thank you for this code!
The file that works:
##gff-version 3 ##source-version geneious 2020.2.2 ##sequence-region pZA21RVapCmCherry 1 4048 pZA21RVapCmCherry Geneious region 1 4048 . + 0 Is_circular=true pZA21RVapCmCherry Geneious exon 2327 3046 . - . Name=tetR pZA21RVapCmCherry Geneious exon 3043 3837 . - . Name=KanR pZA21RVapCmCherry Geneious exon 109 510 . + . Name=vapC pZA21RVapCmCherry Geneious exon 494 1237 . + . Name=mCherry
the file that does not work:
`##gff-version 3
##source-version geneious 2020.2.2
##sequence-region pZA21RVapCmCherry 1 4048
pZA21RVapCmCherry Geneious region 1 4048 . + 0 Is_circular=true
pZA21RVapCmCherry Geneious exon 2327 3046 . - . Name=tetR
pZA21RVapCmCherry Geneious exon 3043 3837 . - . Name=KanR
pZA21RVapCmCherry Geneious exon 109 510 . + . Name=vapC
pZA21RVapCmCherry Geneious exon 494 1237 . + . Name=mCherry
##FASTA
pZA21RVapCmCherry
CTCGAGTCCCTATCAGTGATAGAGATTGACATCCCTATCA
GTGATAGAGATACTGAGCACATCAGCAGGACGCACTGACC
GAATTCGACATATCCACATAAGGAGGCACTGATGCTGAAG
TTTATGCTCGATACCAACATCTGCATTTTTACGATAAAGA
ACAAACCCGCCAGTGTCAGGGAACGTTTTAACCTGAACCA
GGGGAGAATGTGCATCAGTTCGGTCACTCTGATGGAGGTG
ATATATGGTGCAGAAAAAAGCCAGATGCCTGAACGTAATC
TCGCTGTGATCGAGGGATTTGTTTCCCGCATTGACGTTCT
GGATTACGACGCTGCTGCTGCCACACACACCGGCCAGATA
AGAGCAGAACTTGCCCTTCAGGGACGCCCTGTCGGGCCAT
TTGATCAAATGATCGCAGGTCATGCCCGCAGTCGGGGACT
GATTATTGTGACTAATAACACCCGGGAATTTGAACGTGTG
GGCGGCCTGAGAATTGAAGACTGGAGTTGACCTGTTAGGA
GGTACCATGGTGAGCAAGGGCGAGGAGGATAACATGGCCA
TCATCAAGGAGTTCATGCGCTTCAAGGTGCACATGGAGGG
CTCCGTGAACGGCCACGAGTTCGAGATCGAGGGCGAGGGC
GAGGGCCGCCCCTACGAGGGCACCCAGACCGCCAAGCTGA
AGGTGACCAAGGGTGGCCCCCTGCCCTTCGCCTGGGACAT
CCTGTCCCCTCAGTTCATGTACGGCTCCAAGGCCTACGTG
AAGCACCCCGCCGACATCCCCGACTACTTGAAGCTGTCCT
TCCCCGAGGGCTTCAAGTGGGAGCGCGTGATGAACTTCGA
GGACGGCGGCGTGGTGACCGTGACCCAGGACTCCTCCCTG
CAGGACGGCGAGTTCATCTACAAGGTGAAGCTGCGCGGCA
CCAACTTCCCCTCCGACGGCCCCGTAATGCAGAAGAAGAC
CATGGGCTGGGAGGCCTCCTCCGAGCGGATGTACCCCGAG
GACGGCGCCCTGAAGGGCGAGATCAAGCAGAGGCTGAAGC
TGAAGGACGGCGGCCACTACGACGCTGAGGTCAAGACCAC
CTACAAGGCCAAGAAGCCCGTGCAGCTGCCCGGCGCCTAC
AACGTCAACATCAAGTTGGACATCACCTCCCACAACGAGG
ACTACACCATCGTGGAACAGTACGAACGCGCCGAGGGCCG
CCACTCCACCGGCGGCATGGACGAGCTGTACAAGTAAAAG
CTTAATTAGCTGAGTCTAGAGGCATCAAATAAAACGAAAG
GCTCAGTCGAAAGACTGGGCCTTTCGTTTTATCTGTTGTT
TGTCGGTGAACGCTCTCCTGAGTAGGACAAATCCGCCGCC
CTAGACCTAGGGGATATATTCCGCTTCCTCGCTCACTGAC
TCGCTACGCTCGGTCGTTCGACTGCGGCGAGCGGAAATGG
CTTACGAACGGGGCGGAGATTTCCTGGAAGATGCCAGGAA
GATACTTAACAGGGAAGTGAGAGGGCCGCGGCAAAGCCGT
TTTTCCATAGGCTCCGCCCCCCTGACAAGCATCACGAAAT
CTGACGCTCAAATCAGTGGTGGCGAAACCCGACAGGACTA
TAAAGATACCAGGCGTTTCCCCCTGGCGGCTCCCTCGTGC
GCTCTCCTGTTCCTGCCTTTCGGTTTACCGGTGTCATTCC
GCTGTTATGGCCGCGTTTGTCTCATTCCACGCCTGACACT
CAGTTCCGGGTAGGCAGTTCGCTCCAAGCTGGACTGTATG
CACGAACCCCCCGTTCAGTCCGACCGCTGCGCCTTATCCG
GTAACTATCGTCTTGAGTCCAACCCGGAAAGACATGCAAA
AGCACCACTGGCAGCAGCCACTGGTAATTGATTTAGAGGA
GTTAGTCTTGAAGTCATGCGCCGGTTAAGGCTAAACTGAA
AGGACAAGTTTTGGTGACTGCGCTCCTCCAAGCCAGTTAC
CTCGGTTCAAAGAGTTGGTAGCTCAGAGAACCTTCGAAAA
ACCGCCCTGCAAGGCGGTTTTTTCGTTTTCAGAGCAAGAG
ATTACGCGCAGACCAAAACGATCTCAAGAAGATCATCTTA
TTAATCAGATAAAATATTTCTAGATTTCAGTGCAATTTAT
CTCTTCAAATGTAGCACCTGAAGTCAGCCCCATACGATAT
AAGTTGTTACTAGTGCTTGGATTCTCACCAATAAAAAACG
CCCGGCGGCAACCGAGCGTTCTGAACAAATCCAGATGGAG
TTCTGAGGTCATTACTGGATCTATCAACAGGAGTCCAAGC
GAGCTCTAGCTCTAGGCTACTCAGCTATCTAGAAAGCTTA
AGATCCTTAAGACCCACTTTCACATTTAAGTTGTTTTTCT
AATCCGCATATGATCAATTCAAGGCCGAATAAGAAGGCTG
GCTCTGCACCTTGGTGATCAAATAATTCGATAGCTTGTCG
TAATAATGGCGGCATACTATCAGTAGTAGGTGTTTCCCTT
TCTTCTTTAGCGACTTGATGCTCTTGATCTTCCAATACGC
AACCTAAAGTAAAATGCCCCACAGCGCTGAGTGCATATAA
TGCATTCTCTAGTGAAAAACCTTGTTGGCATAAAAAGGCT
AATTGATTTTCGAGAGTTTCATACTGTTTTTCTGTAGGCC
GTGTACCTAAATGTACTTTTGCTCCATCGCGATGACTTAG
TAAAGCACATCTAAAACTTTTAGCCTTATTACGTAAAAAA
TCTTGCCAGCTTTCCCCTTCTAAAGGGCAAAAGTGAGTAT
GGTGCCTATCTAACATCTCAATGGCTAAGGCGTCGAGCAA
AGCCCGCTTATTTTTTACATGCCAATACAATGTAGGCTGC
TCTACACCTAGCTTCTGGGCGAGTTTACGGGTTGTTAAAC
CTTCGATTCCGACCTCATTAAGCAGCTCTAATGCGCTGTT
AATCACTTTACTTTTATCTAATCTAGACATATGAATTCGG
GGCGGGATTTCATGGATATGTTTCTTTCTGCGAGAACCAG
CCATATCAGTACCTCCTGAGCTCTCGAACCCCAGAGTCCC
GCTCAGAAGAACTCGTCAAGAAGGCGATAGAAGGCGATGC
GCTGCGAATCGGGAGCGGCGATACCGTAAAGCACGAGGAA
GCGGTCAGCCCATTCGCCGCCAAGCTCTTCAGCAATATCA
CGGGTAGCCAACGCTATGTCCTGATAGCGGTCCGCCACAC
CCAGCCGGCCACAGTCGATGAATCCAGAAAAGCGGCCATT
TTCCACCATGATATTCGGCAAGCAGGCATCGCCATGGGTC
ACGACGAGATCCTCGCCGTCGGGCATGCGCGCCTTGAGCC
TGGCGAACAGTTCGGCTGGCGCGAGCCCCTGATGCTCTTC
GTCCAGATCATCCTGATCGACAAGACCGGCTTCCATCCGA
GTACGTGCTCGCTCGATGCGATGTTTCGCTTGGTGGTCGA
ATGGGCAGGTAGCCGGATCAAGCGTATGCAGCCGCCGCAT
TGCATCAGCCATGATGGATACTTTCTCGGCAGGAGCAAGG
TGAGATGACAGGAGATCCTGCCCCGGCACTTCGCCCAATA
GCAGCCAGTCCCTTCCCGCTTCAGTGACAACGTCGAGCAC
AGCTGCGCAAGGAACGCCCGTCGTGGCCAGCCACGATAGC
CGCGCTGCCTCGTCCTGCAGTTCATTCAGGGCACCGGACA
GGTCGGTCTTGACAAAAAGAACCGGGCGCCCCTGCGCTGA
CAGCCGGAACACGGCGGCATCAGAGCAGCCGATTGTCTGT
TGTGCCCAGTCATAGCCGAATAGCCTCTCCACCCAAGCGG
CCGGAGAACCTGCGTGCAATCCATCTTGTTCAATCATGCG
AAACGATCCTCATCCTGTCTCTTGATCAGATCTTGATCCC
CTGCGCCATCAGATCCTTGGCGGCAAGAAAGCCATCCAGT
TTACTTTGCAGGGCTTCCCAACCTTACCAGAGGGCGCCCC
AGCTGGCAATTCCGACGTCTAAGAAACCATTATTATCATG
ACATTAACCTATAAAAATAGGCGTATCACGAGGCCCTTTC
GTCTTCAC
working.gff.txt
not_working.gff.txt
`
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
Hi,
Are there any examples of how to change IDs in a GFF3 file and writing it back to file?
Thank you in advance,
Michal
Parse the gff : GCF_000005845.2_ASM584v2_genomic.gff.gz
Cannot parse the attributes column. There is an extra =
in this column.
The line that makes gffpandas crash is like :
NC_000913.3 RefSeq CDS 257829 257899 . + 0 ID=cds-gnl|b0240|CDS=288;Parent=gene-b0240;Dbxref=ASAP:ABE-0000821,ECOCYC:EG11092,EcoGene:EG11092,GeneID:945077;gbkey=CDS;gene=crl;locus_tag=b0240;orig_protein_id=gnl|b0240|CDS%3D288;orig_transcript_id=gnl|b0240|mrna.CDS%3D288;product=RNA polymerase holoenzyme assembly factor Crl;transl_table=11
More precisely, this is this part : ID=cds-gnl|b0240|CDS=288;
that contains an extra =
import gffpandas.gffpandas as gffpd
annotation = gffpd.read_gff3(genome_path)
annotation = annotation.attributes_to_columns()
This can be fixed by replacing key_value_pair.split('=')
by key_value_pair.split('=', 1)
in the function attributes_to_columns
However, I'm not sure it is a good solution :)
I can create a PR for that if you agree.
This would increase visibility and usability, I want to use gffpandas as part of a large workflow using snakemake, makes software less redistributable if I can't
I'm enabling reading of GFF files from a URL with decompression. Pandas can do this, and can also read from a buffer-like object, but _read_gff_header explicitly opens a file, thus requiring explicit downloads to a local file. I believe that files also get opened twice, once in the header read and a second time by pandas.
Please allow input of buffer objects.
Hi all,
The below script parsed BLAST XML output and read a GFF3 file with gffpandas. How is it possible in gffpanad to add 'Note=Gene description' if mRNA's ID is in the BLAST output?
#!/usr/bin/python3
import click
from Bio.Blast import NCBIXML
import gffpandas.gffpandas as gffpd
class Hits():
def __init__(self, hit_id, hit_def):
self.hit_id = hit_id
self.hit_def = hit_def
def retrieve_hits_data(blast_XML_file):
with open(blast_XML_file) as bf:
blast_records = NCBIXML.parse(bf)
hits_all = {}
for blast_record in blast_records:
query_name = blast_record.query
try:
hit_id = blast_record.alignments[0].hit_id.split('|')[1]
hit_def = blast_record.alignments[0].hit_def.split("OS=")[0]
hits_all[query_name] = Hits(hit_id,hit_def)
except IndexError:
continue
return hits_all
@click.command()
@click.option('--gff3', help="Provide GFF3 file", required=True)
@click.option('--keep', help="Keep GFF3 file", required=True)
@click.option('--reject', help="Reject GFF3 file", required=True)
@click.option('--xml', help="Blast XML file", required=True)
def run(gff3, keep, reject, xml):
blast_hits = retrieve_hits_data(xml)
annotation = gffpd.read_gff3('data/augustus.hints_utr.gff3')
if __name__ == '__main__':
run()
Thank you in advance,
Michal
i hope it can read gz file,like pandas and gtfparser,
import gffpandas.gffpandas as gffpd
annotation = gffpd.read_gff3('gencode.v32.chr_patch_hapl_scaff.annotation.gff3.gz')
and i can not know version,it not be in your doc,i think it should be magic variable
In [6]: gffpd.__version__
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[6], line 1
----> 1 gffpd.__version__
AttributeError: module 'gffpandas.gffpandas' has no attribute '__version__'
it raise Exception when i read gz file
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
Cell In[3], line 2
1 import gffpandas.gffpandas as gffpd
----> 2 annotation = gffpd.read_gff3('gencode.v32.chr_patch_hapl_scaff.annotation.gff3.gz')
File ~/anaconda3/lib/python3.8/site-packages/gffpandas/gffpandas.py:6, in read_gff3(input_file)
5 def read_gff3(input_file):
----> 6 return Gff3DataFrame(input_file)
File ~/anaconda3/lib/python3.8/site-packages/gffpandas/gffpandas.py:22, in Gff3DataFrame.__init__(self, input_gff_file, input_df, input_header)
20 self._gff_file = input_gff_file
21 self._read_gff3_to_df()
---> 22 self._read_gff_header()
23 else:
24 self.df = input_df
File ~/anaconda3/lib/python3.8/site-packages/gffpandas/gffpandas.py:44, in Gff3DataFrame._read_gff_header(self)
39 """Create a header.
40
41 The header of the gff file is read, means all lines,
42 which start with '#'."""
43 self.header = ''
---> 44 for line in open(self._gff_file):
45 if line.startswith('#'):
46 self.header += line
File ~/anaconda3/lib/python3.8/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
319 def decode(self, input, final=False):
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
```
```
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
```
Running gffpandas over a badly-formed GFF doen't give any indication where the problem occurred. Line numbers in ValueError messages would be invaluable. To accomplish this, you'll have to refactor a bit. Consider adding a
line-number column to the data frame for temporary internal-only use and doing onversions in df.iterrows() or df.groupby() using try/except rather than in apply(). apply() fails silently by design, which is good for some situations but not this one.
I got the following failure when I read
this GFF file:
File "gffpandas/gffpandas.py", line 133, in
lambda attributes: dict([key_value_pair.split('=') for
ValueError: dictionary update sequence element #6 has length 1; 2 is required
This lambda is one of three in gffpandas. In my opinion, all 3 should be
refactored into map() or stand-alone functions. Lambdas are hard to read
and even harder to make defensive.
This failure is because a handful of features in this file end with a ";". Yes, this is a somewhat rare thing, but I also see it on some NCBI gffs. Like many ugly things in GFF-land, I don't know if it's strictly valid, but plenty of GFFs that pass GFF validators (e.g. the one in genometools) have them. There's an additional (even more rare) problem that somebody can put an equal sign into the feature string.
Here's a function that can be substituted for the lambda expression that fixes the problem:
def _split_atts(atts):
"""Split a feature string into attributes."""
splits_list = [a.split("=") for a in atts.split(";") if "=" in a]
return {l[0]:"=".join(l[1:]) for l in splits_list}
In https://gffpandas.readthedocs.io/en/latest/background.html there is a link to a local file
'How to use gffpandas'. It links to file:///home/vivian/gffPandas/gffpandas/docs/build/html/tutorial.html
It's good practice to run tests with warnings turned on, which one accomplishes via
warnings.filterwarnings("error")
.
When I do this with my code, I get a complaint that gffpandas doesn't close the gff file.
gffpandas has pytest-runner as a setup_requires hard dependency, meaning that setup will fail without pytest-runner installed, even though pytest-runner is only needed for testing. gffpandas shares this unusual feature with code like screed, which points to a common ancestor.
pytest-runner is deprecated because it has security vulnerabilities. It's broken with the latest versions of setuptools anyway. If you remove the line 'pytest-runner' in setup.py, the package will build properly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.