My issue: I'm reading in a GTF file with exons. I want to collect the exons for a

start site off-by-one errors when creating BedTool from list of intervals about pybedtools HOT 6 CLOSED

daler commented on July 28, 2024

start site off-by-one errors when creating BedTool from list of intervals

from pybedtools.

Comments (6)

daler commented on July 28, 2024

Hi Wolfgang, thanks for reporting this. Which version are you using? I pasted in your exact commands in doctest_mode and can't reproduce this:

>>> In [1]: import pybedtools
>>> In [2]: gtf = pybedtools.BedTool("test.gtf")
>>> In [4]: gtf_intervals = list(gtf)
>>> In [5]: gtf_intervals
[Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]
>>> In [24]: new_bt = pybedtools.BedTool(gtf_intervals)
>>> new_bt
[Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]
>>> In [26]: new_bt_merged = new_bt.merge()
>>> In [27]: new_bt_merged
<BedTool(/tmp/pybedtools.kHuwlf.tmp)>
>>> In [28]: new_bt_merged[0]
Interval(chr5:137924247-137924468)
>>> In [29]: new_bt_merged[0].start
137924247L

Can you try this short script to see what results you get?

import pybedtools
x = pybedtools.BedTool('test.gtf')
xm = x.merge()
iter_xm = pybedtools.BedTool(i for i in x).merge()
print x
print xm
print x[0].start
print xm[0].start
print iter_xm[0].start

i get the following:

chr5    ucsc_refseq exon    137924248   137924468   .   -   .   gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 5
chr5    ucsc_refseq exon    137925209   137925388   .   -   .   gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 4

chr5    137924247   137924468
chr5    137925208   137925388

137924247
137924247
137924247

from pybedtools.

wresch commented on July 28, 2024

Hi Ryan,

import pybedtools
pybedtools.version
pybedtools.version
'0.5.5'

mergeBed -h

Program: mergeBed (v2.12.0)
Author: Aaron Quinlan ([email protected])
Summary: Merges overlapping BED/GFF/VCF entries into a single interval.

output from your short script:
chr5 ucsc_refseq exon 137924248 137924468 . -
. gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 5
chr5 ucsc_refseq exon 137925209 137925388 . -
. gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 4

chr5 137924248 137924468
chr5 137925209 137925388

137924247
137924248
137924248

Odd.

Thanks,
Wolf

On Tue, Feb 28, 2012 at 4:33 PM, Ryan Dale <
[email protected]

wrote:

Hi Wolfgang, thanks for reporting this. Which version are you using? I
pasted in your exact commands in doctest_mode and can't reproduce this:

>>> In [1]: import pybedtools
>>> In [2]: gtf = pybedtools.BedTool("test.gtf")
>>> In [4]: gtf_intervals = list(gtf)
>>> In [5]: gtf_intervals
[Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]
>>> In [24]: new_bt = pybedtools.BedTool(gtf_intervals)
>>> new_bt
[Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]
>>> In [26]: new_bt_merged = new_bt.merge()
>>> In [27]: new_bt_merged
<BedTool(/tmp/pybedtools.kHuwlf.tmp)>
>>> In [28]: new_bt_merged[0]
Interval(chr5:137924247-137924468)
>>> In [29]: new_bt_merged[0].start
137924247L

Can you try this short script to see what results you get?

import pybedtools
x = pybedtools.BedTool('test.gtf')
xm = x.merge()
iter_xm = pybedtools.BedTool(i for i in x).merge()
print x
print xm
print x[0].start
print xm[0].start
print iter_xm[0].start

i get the following:

chr5    ucsc_refseq     exon    137924248       137924468       .       -
      .       gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 5
chr5    ucsc_refseq     exon    137925209       137925388       .       -
      .       gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 4

chr5    137924247       137924468
chr5    137925208       137925388

137924247
137924247
137924247

Reply to this email directly or view it on GitHub:
#53 (comment)

from pybedtools.

daler commented on July 28, 2024

Can you pull the latest pybedtools from GitHub? I actually haven't changed the version number since 0.5.5 despite many changes elsewhere in the code base (though I'm about to release 0.6).

You also might want to upgrade BEDTools to 2.15 if possible. It's working on my end, using the github versions of pybedtools and BEDTools, so it's likely this issue has been fixed -- just not in released versions yet.

from pybedtools.

wresch commented on July 28, 2024

Actually just upgrading to bedtools 2.15 solved it. I had not realized i
was behind the curve by that much...

Thanks for the very quick reply!

Wolf

chr5 ucsc_refseq exon 137924248 137924468 . -
. gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 5
chr5 ucsc_refseq exon 137925209 137925388 . -
. gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942";
tss_id "tss_07393"; exon_number 4

chr5 137924247 137924468
chr5 137925208 137925388

137924247
137924247
137924247

On Tue, Feb 28, 2012 at 5:03 PM, Ryan Dale <
[email protected]

wrote:

Can you pull the latest pybedtools from GitHub? I actually haven't
changed the version number since 0.5.5 despite many changes elsewhere in
the code base (though I'm about to release 0.6).

You also might want to upgrade BEDTools to 2.15 if possible. It's working
on my end, using the github versions of pybedtools and BEDTools, so it's
likely this issue has been fixed -- just not in released versions yet.

Reply to this email directly or view it on GitHub:
#53 (comment)

from pybedtools.

daler commented on July 28, 2024

Good to hear.

By the way, itertools.groupBy and the BedTool.total_coverage method will probably come in handy for what you're doing . . . just tested this on a mouse GTF file. It should run a lot faster if you can assume the input file is already sorted by gene name.

import pybedtools
import itertools

ex = pybedtools.BedTool('mm9.gtf.chr18')\
        .filter(lambda x: x[2]=='exon')\
        .saveas()

def key(x):
    return x['gene_name']

exons_by_gene = sorted(ex, key=key)
for gene, exons in itertools.groupby(exons_by_gene, key=key):
    print gene,
    print pybedtools.BedTool(exons).sort().total_coverage()

from pybedtools.

wresch commented on July 28, 2024

That is indeed better than what i was doing.

On Tue, Feb 28, 2012 at 5:14 PM, Ryan Dale <
[email protected]

wrote:

Good to hear.

By the way, itertools.groupBy and the BedTool.total_coverage method
will probably come in handy for what you're doing . . . just tested this on
a mouse GTF file. It should run a lot faster if you can assume the input
file is already sorted by gene name.
import pybedtools
import itertools

ex = pybedtools.BedTool('mm9.gtf.chr18')\
       .filter(lambda x: x[2]=='exon')\
       .saveas()

def key(x):
   return x['gene_name']

exons_by_gene = sorted(ex, key=key)
for gene, exons in itertools.groupby(exons_by_gene, key=key):
   print gene,
   print pybedtools.BedTool(exons).sort().total_coverage()
Reply to this email directly or view it on GitHub:
#53 (comment)

from pybedtools.

start site off-by-one errors when creating BedTool from list of intervals about pybedtools HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent