foerstner-lab / gffpandas Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 10.0 2.64 MB

Parse GFF3 into Pandas dataframes

License: ISC License

Makefile 4.85% Python 94.38% TeX 0.77%

gffpandas's People

Contributors

Stargazers

Watchers

Forkers

vivianmonzon rplanel yangzhou0916 hongqin legumeinfo mpievolbio-scicomp siddc crabron florianzwagemaker

gffpandas's Issues

gff file containing "fasta" section raises an error

pandasgff version:
Python version: 3.7
Operating System: ubuntu

Description

I have a gff file containing a section starting with "##FASTA". gffpd.read_gff3 raises an error on it.
When I delete the section after "##FASTA" the function gffpd.read_gff3 works, so it looks like an easy to fix issue.
Thank you for this code!

The file that works:
##gff-version 3 ##source-version geneious 2020.2.2 ##sequence-region pZA21RVapCmCherry 1 4048 pZA21RVapCmCherry Geneious region 1 4048 . + 0 Is_circular=true pZA21RVapCmCherry Geneious exon 2327 3046 . - . Name=tetR pZA21RVapCmCherry Geneious exon 3043 3837 . - . Name=KanR pZA21RVapCmCherry Geneious exon 109 510 . + . Name=vapC pZA21RVapCmCherry Geneious exon 494 1237 . + . Name=mCherry

the file that does not work:

`##gff-version 3
##source-version geneious 2020.2.2
##sequence-region pZA21RVapCmCherry 1 4048
pZA21RVapCmCherry Geneious region 1 4048 . + 0 Is_circular=true
pZA21RVapCmCherry Geneious exon 2327 3046 . - . Name=tetR
pZA21RVapCmCherry Geneious exon 3043 3837 . - . Name=KanR
pZA21RVapCmCherry Geneious exon 109 510 . + . Name=vapC
pZA21RVapCmCherry Geneious exon 494 1237 . + . Name=mCherry
##FASTA

pZA21RVapCmCherry
CTCGAGTCCCTATCAGTGATAGAGATTGACATCCCTATCA
GTGATAGAGATACTGAGCACATCAGCAGGACGCACTGACC
GAATTCGACATATCCACATAAGGAGGCACTGATGCTGAAG
TTTATGCTCGATACCAACATCTGCATTTTTACGATAAAGA
ACAAACCCGCCAGTGTCAGGGAACGTTTTAACCTGAACCA
GGGGAGAATGTGCATCAGTTCGGTCACTCTGATGGAGGTG
ATATATGGTGCAGAAAAAAGCCAGATGCCTGAACGTAATC
TCGCTGTGATCGAGGGATTTGTTTCCCGCATTGACGTTCT
GGATTACGACGCTGCTGCTGCCACACACACCGGCCAGATA
AGAGCAGAACTTGCCCTTCAGGGACGCCCTGTCGGGCCAT
TTGATCAAATGATCGCAGGTCATGCCCGCAGTCGGGGACT
GATTATTGTGACTAATAACACCCGGGAATTTGAACGTGTG
GGCGGCCTGAGAATTGAAGACTGGAGTTGACCTGTTAGGA
GGTACCATGGTGAGCAAGGGCGAGGAGGATAACATGGCCA
TCATCAAGGAGTTCATGCGCTTCAAGGTGCACATGGAGGG
CTCCGTGAACGGCCACGAGTTCGAGATCGAGGGCGAGGGC
GAGGGCCGCCCCTACGAGGGCACCCAGACCGCCAAGCTGA
AGGTGACCAAGGGTGGCCCCCTGCCCTTCGCCTGGGACAT
CCTGTCCCCTCAGTTCATGTACGGCTCCAAGGCCTACGTG
AAGCACCCCGCCGACATCCCCGACTACTTGAAGCTGTCCT
TCCCCGAGGGCTTCAAGTGGGAGCGCGTGATGAACTTCGA
GGACGGCGGCGTGGTGACCGTGACCCAGGACTCCTCCCTG
CAGGACGGCGAGTTCATCTACAAGGTGAAGCTGCGCGGCA
CCAACTTCCCCTCCGACGGCCCCGTAATGCAGAAGAAGAC
CATGGGCTGGGAGGCCTCCTCCGAGCGGATGTACCCCGAG
GACGGCGCCCTGAAGGGCGAGATCAAGCAGAGGCTGAAGC
TGAAGGACGGCGGCCACTACGACGCTGAGGTCAAGACCAC
CTACAAGGCCAAGAAGCCCGTGCAGCTGCCCGGCGCCTAC
AACGTCAACATCAAGTTGGACATCACCTCCCACAACGAGG
ACTACACCATCGTGGAACAGTACGAACGCGCCGAGGGCCG
CCACTCCACCGGCGGCATGGACGAGCTGTACAAGTAAAAG
CTTAATTAGCTGAGTCTAGAGGCATCAAATAAAACGAAAG
GCTCAGTCGAAAGACTGGGCCTTTCGTTTTATCTGTTGTT
TGTCGGTGAACGCTCTCCTGAGTAGGACAAATCCGCCGCC
CTAGACCTAGGGGATATATTCCGCTTCCTCGCTCACTGAC
TCGCTACGCTCGGTCGTTCGACTGCGGCGAGCGGAAATGG
CTTACGAACGGGGCGGAGATTTCCTGGAAGATGCCAGGAA
GATACTTAACAGGGAAGTGAGAGGGCCGCGGCAAAGCCGT
TTTTCCATAGGCTCCGCCCCCCTGACAAGCATCACGAAAT
CTGACGCTCAAATCAGTGGTGGCGAAACCCGACAGGACTA
TAAAGATACCAGGCGTTTCCCCCTGGCGGCTCCCTCGTGC
GCTCTCCTGTTCCTGCCTTTCGGTTTACCGGTGTCATTCC
GCTGTTATGGCCGCGTTTGTCTCATTCCACGCCTGACACT
CAGTTCCGGGTAGGCAGTTCGCTCCAAGCTGGACTGTATG
CACGAACCCCCCGTTCAGTCCGACCGCTGCGCCTTATCCG
GTAACTATCGTCTTGAGTCCAACCCGGAAAGACATGCAAA
AGCACCACTGGCAGCAGCCACTGGTAATTGATTTAGAGGA
GTTAGTCTTGAAGTCATGCGCCGGTTAAGGCTAAACTGAA
AGGACAAGTTTTGGTGACTGCGCTCCTCCAAGCCAGTTAC
CTCGGTTCAAAGAGTTGGTAGCTCAGAGAACCTTCGAAAA
ACCGCCCTGCAAGGCGGTTTTTTCGTTTTCAGAGCAAGAG
ATTACGCGCAGACCAAAACGATCTCAAGAAGATCATCTTA
TTAATCAGATAAAATATTTCTAGATTTCAGTGCAATTTAT
CTCTTCAAATGTAGCACCTGAAGTCAGCCCCATACGATAT
AAGTTGTTACTAGTGCTTGGATTCTCACCAATAAAAAACG
CCCGGCGGCAACCGAGCGTTCTGAACAAATCCAGATGGAG
TTCTGAGGTCATTACTGGATCTATCAACAGGAGTCCAAGC
GAGCTCTAGCTCTAGGCTACTCAGCTATCTAGAAAGCTTA
AGATCCTTAAGACCCACTTTCACATTTAAGTTGTTTTTCT
AATCCGCATATGATCAATTCAAGGCCGAATAAGAAGGCTG
GCTCTGCACCTTGGTGATCAAATAATTCGATAGCTTGTCG
TAATAATGGCGGCATACTATCAGTAGTAGGTGTTTCCCTT
TCTTCTTTAGCGACTTGATGCTCTTGATCTTCCAATACGC
AACCTAAAGTAAAATGCCCCACAGCGCTGAGTGCATATAA
TGCATTCTCTAGTGAAAAACCTTGTTGGCATAAAAAGGCT
AATTGATTTTCGAGAGTTTCATACTGTTTTTCTGTAGGCC
GTGTACCTAAATGTACTTTTGCTCCATCGCGATGACTTAG
TAAAGCACATCTAAAACTTTTAGCCTTATTACGTAAAAAA
TCTTGCCAGCTTTCCCCTTCTAAAGGGCAAAAGTGAGTAT
GGTGCCTATCTAACATCTCAATGGCTAAGGCGTCGAGCAA
AGCCCGCTTATTTTTTACATGCCAATACAATGTAGGCTGC
TCTACACCTAGCTTCTGGGCGAGTTTACGGGTTGTTAAAC
CTTCGATTCCGACCTCATTAAGCAGCTCTAATGCGCTGTT
AATCACTTTACTTTTATCTAATCTAGACATATGAATTCGG
GGCGGGATTTCATGGATATGTTTCTTTCTGCGAGAACCAG
CCATATCAGTACCTCCTGAGCTCTCGAACCCCAGAGTCCC
GCTCAGAAGAACTCGTCAAGAAGGCGATAGAAGGCGATGC
GCTGCGAATCGGGAGCGGCGATACCGTAAAGCACGAGGAA
GCGGTCAGCCCATTCGCCGCCAAGCTCTTCAGCAATATCA
CGGGTAGCCAACGCTATGTCCTGATAGCGGTCCGCCACAC
CCAGCCGGCCACAGTCGATGAATCCAGAAAAGCGGCCATT
TTCCACCATGATATTCGGCAAGCAGGCATCGCCATGGGTC
ACGACGAGATCCTCGCCGTCGGGCATGCGCGCCTTGAGCC
TGGCGAACAGTTCGGCTGGCGCGAGCCCCTGATGCTCTTC
GTCCAGATCATCCTGATCGACAAGACCGGCTTCCATCCGA
GTACGTGCTCGCTCGATGCGATGTTTCGCTTGGTGGTCGA
ATGGGCAGGTAGCCGGATCAAGCGTATGCAGCCGCCGCAT
TGCATCAGCCATGATGGATACTTTCTCGGCAGGAGCAAGG
TGAGATGACAGGAGATCCTGCCCCGGCACTTCGCCCAATA
GCAGCCAGTCCCTTCCCGCTTCAGTGACAACGTCGAGCAC
AGCTGCGCAAGGAACGCCCGTCGTGGCCAGCCACGATAGC
CGCGCTGCCTCGTCCTGCAGTTCATTCAGGGCACCGGACA
GGTCGGTCTTGACAAAAAGAACCGGGCGCCCCTGCGCTGA
CAGCCGGAACACGGCGGCATCAGAGCAGCCGATTGTCTGT
TGTGCCCAGTCATAGCCGAATAGCCTCTCCACCCAAGCGG
CCGGAGAACCTGCGTGCAATCCATCTTGTTCAATCATGCG
AAACGATCCTCATCCTGTCTCTTGATCAGATCTTGATCCC
CTGCGCCATCAGATCCTTGGCGGCAAGAAAGCCATCCAGT
TTACTTTGCAGGGCTTCCCAACCTTACCAGAGGGCGCCCC
AGCTGGCAATTCCGACGTCTAAGAAACCATTATTATCATG
ACATTAACCTATAAAAATAGGCGTATCACGAGGCCCTTTC
GTCTTCAC
working.gff.txt
not_working.gff.txt
`

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

change IDs in a GFF3

Hi,
Are there any examples of how to change IDs in a GFF3 file and writing it back to file?

Thank you in advance,

Michal

Parse column attributes

pandasgff version: 1.2.0
Python version: 3.7.3
Operating System: Gnu/Linux

Description

Parse the gff : GCF_000005845.2_ASM584v2_genomic.gff.gz

Cannot parse the attributes column. There is an extra = in this column.

The line that makes gffpandas crash is like :

NC_000913.3	RefSeq	CDS	257829	257899	.	+	0	ID=cds-gnl|b0240|CDS=288;Parent=gene-b0240;Dbxref=ASAP:ABE-0000821,ECOCYC:EG11092,EcoGene:EG11092,GeneID:945077;gbkey=CDS;gene=crl;locus_tag=b0240;orig_protein_id=gnl|b0240|CDS%3D288;orig_transcript_id=gnl|b0240|mrna.CDS%3D288;product=RNA polymerase holoenzyme assembly factor Crl;transl_table=11

More precisely, this is this part : ID=cds-gnl|b0240|CDS=288; that contains an extra =

What I Did

import gffpandas.gffpandas as gffpd
annotation = gffpd.read_gff3(genome_path)
annotation = annotation.attributes_to_columns()

This can be fixed by replacing key_value_pair.split('=') by key_value_pair.split('=', 1) in the function attributes_to_columns

However, I'm not sure it is a good solution :)

I can create a PR for that if you agree.

Make gffpandas available on bioconda

This would increase visibility and usability, I want to use gffpandas as part of a large workflow using snakemake, makes software less redistributable if I can't

Feature request: allow reading from buffer

I'm enabling reading of GFF files from a URL with decompression. Pandas can do this, and can also read from a buffer-like object, but _read_gff_header explicitly opens a file, thus requiring explicit downloads to a local file. I believe that files also get opened twice, once in the header read and a second time by pandas.

Please allow input of buffer objects.

adding gene description

Hi all,
The below script parsed BLAST XML output and read a GFF3 file with gffpandas. How is it possible in gffpanad to add 'Note=Gene description' if mRNA's ID is in the BLAST output?

#!/usr/bin/python3
import click
from Bio.Blast import NCBIXML
import gffpandas.gffpandas as gffpd

class Hits():
    def __init__(self, hit_id, hit_def):
        self.hit_id = hit_id
        self.hit_def = hit_def

def retrieve_hits_data(blast_XML_file):
    with open(blast_XML_file) as bf:
        blast_records = NCBIXML.parse(bf)
        hits_all = {}

        for blast_record in blast_records:
            query_name = blast_record.query
            try:
                hit_id = blast_record.alignments[0].hit_id.split('|')[1]
                hit_def = blast_record.alignments[0].hit_def.split("OS=")[0]
                hits_all[query_name] = Hits(hit_id,hit_def)
            except IndexError:
                continue
    return hits_all

@click.command()
@click.option('--gff3', help="Provide GFF3 file", required=True)
@click.option('--keep', help="Keep GFF3 file", required=True)
@click.option('--reject', help="Reject GFF3 file", required=True)
@click.option('--xml', help="Blast XML file", required=True)
def run(gff3, keep, reject, xml):
    blast_hits = retrieve_hits_data(xml)
    annotation = gffpd.read_gff3('data/augustus.hints_utr.gff3')


if __name__ == '__main__':
    run()

Thank you in advance,

Michal

it does not work for gz file

pandasgff version:
Python version:3.8.8
Operating System:centos

Description

i hope it can read gz file,like pandas and gtfparser,

import gffpandas.gffpandas as gffpd
annotation = gffpd.read_gff3('gencode.v32.chr_patch_hapl_scaff.annotation.gff3.gz')

and i can not know version,it not be in your doc,i think it should be magic variable

In [6]: gffpd.__version__
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[6], line 1
----> 1 gffpd.__version__

AttributeError: module 'gffpandas.gffpandas' has no attribute '__version__'

What I Did

it raise Exception when i read gz file

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[3], line 2
      1 import gffpandas.gffpandas as gffpd
----> 2 annotation = gffpd.read_gff3('gencode.v32.chr_patch_hapl_scaff.annotation.gff3.gz')

File ~/anaconda3/lib/python3.8/site-packages/gffpandas/gffpandas.py:6, in read_gff3(input_file)
      5 def read_gff3(input_file):
----> 6     return Gff3DataFrame(input_file)

File ~/anaconda3/lib/python3.8/site-packages/gffpandas/gffpandas.py:22, in Gff3DataFrame.__init__(self, input_gff_file, input_df, input_header)
     20     self._gff_file = input_gff_file
     21     self._read_gff3_to_df()
---> 22     self._read_gff_header()
     23 else:
     24     self.df = input_df

File ~/anaconda3/lib/python3.8/site-packages/gffpandas/gffpandas.py:44, in Gff3DataFrame._read_gff_header(self)
     39 """Create a header.
     40 
     41 The header of the gff file is read, means all lines,
     42 which start with '#'."""
     43 self.header = ''
---> 44 for line in open(self._gff_file):
     45     if line.startswith('#'):
     46         self.header += line

File ~/anaconda3/lib/python3.8/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
    319 def decode(self, input, final=False):
    320     # decode input (taking the buffer into account)
    321     data = self.buffer + input
--> 322     (result, consumed) = self._buffer_decode(data, self.errors, final)
    323     # keep undecoded input until the next call
    324     self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
```

```
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
```

no indication of problem lines in badly-formed GFFs

Running gffpandas over a badly-formed GFF doen't give any indication where the problem occurred. Line numbers in ValueError messages would be invaluable. To accomplish this, you'll have to refactor a bit. Consider adding a
line-number column to the data frame for temporary internal-only use and doing onversions in df.iterrows() or df.groupby() using try/except rather than in apply(). apply() fails silently by design, which is good for some situations but not this one.

attributes_to_columns() raises ValueError on some valid GFFs

pandasgff version: 1.2.0
Python version: 3.8
Operating System: linux

Description

I got the following failure when I read
this GFF file:

File "gffpandas/gffpandas.py", line 133, in
lambda attributes: dict([key_value_pair.split('=') for
ValueError: dictionary update sequence element #6 has length 1; 2 is required

This lambda is one of three in gffpandas. In my opinion, all 3 should be
refactored into map() or stand-alone functions. Lambdas are hard to read
and even harder to make defensive.

This failure is because a handful of features in this file end with a ";". Yes, this is a somewhat rare thing, but I also see it on some NCBI gffs. Like many ugly things in GFF-land, I don't know if it's strictly valid, but plenty of GFFs that pass GFF validators (e.g. the one in genometools) have them. There's an additional (even more rare) problem that somebody can put an equal sign into the feature string.

Here's a function that can be substituted for the lambda expression that fixes the problem:

def _split_atts(atts):
"""Split a feature string into attributes."""
splits_list = [a.split("=") for a in atts.split(";") if "=" in a]
return {l[0]:"=".join(l[1:]) for l in splits_list}

error in documentation

pandasgff version:
Python version:
Operating System:

Description

In https://gffpandas.readthedocs.io/en/latest/background.html there is a link to a local file
'How to use gffpandas'. It links to file:///home/vivian/gffPandas/gffpandas/docs/build/html/tutorial.html

close all files

It's good practice to run tests with warnings turned on, which one accomplishes via
warnings.filterwarnings("error").

When I do this with my code, I get a complaint that gffpandas doesn't close the gff file.

remove pytest-runner dependency

gffpandas has pytest-runner as a setup_requires hard dependency, meaning that setup will fail without pytest-runner installed, even though pytest-runner is only needed for testing. gffpandas shares this unusual feature with code like screed, which points to a common ancestor.

pytest-runner is deprecated because it has security vulnerabilities. It's broken with the latest versions of setuptools anyway. If you remove the line 'pytest-runner' in setup.py, the package will build properly.

foerstner-lab / gffpandas Goto Github PK

gffpandas's People

Contributors

Stargazers

Watchers

Forkers

gffpandas's Issues

Description

What I Did

Description

What I Did

Description

What I Did

Description

Description

Recommend Projects

Recommend Topics

Recommend Org