Giter Site home page Giter Site logo

Comments (16)

brentp avatar brentp commented on July 28, 2024

You can fix this issue with:

diff --git a/pybedtools/helpers.py b/pybedtools/helpers.py
index 7733c15..7af0e00 100644
--- a/pybedtools/helpers.py
+++ b/pybedtools/helpers.py
@@ -301,7 +301,10 @@ def call_bedtools(cmds, tmpfn=None, stdin=None, check_stderr=None):
         print '\n\t' + '\n\t'.join(problems[err.errno])
         raise OSError('See above for commands that gave the error')

-    return output
+    try:
+        return output
+    finally:
+        output.close()


 def set_bedtools_path(path=""):

but I'm not sure if that's supported in 2.6 and I dont think it's in 2.5.
Also, note that there's currently no __del__ method on a BedTool, though
I tried adding one and it doesn't get called. Until the loop finishes.

from pybedtools.

daler avatar daler commented on July 28, 2024

disregard that last commit-comment . . .

tried the following, but it tries to close BedTools that are string- or iterator-based:

try:
    return output
finally:
    output.close()

then tried this, cause really what we want to close are the stdout file-like objects of Popen objects:

try:
    if hasattr(output, 'close'):
        output.close()
finally:
    return output

but tests fail on line 226 of helpers (https://github.com/daler/pybedtools/blob/master/pybedtools/helpers.py#L226), for tests that try to read a Popen.stdout from a "streamed" BedTool.

So the problem is that we don't want to always close a Popen.stdout, because it may be used later to chain together iterable BedTools. That is, the iterator can be set up during this call to call_bedtools, but it may not actually get used for an undefined amount of time. Maybe there's a way to detect if it's been consumed or not?

from pybedtools.

brentp avatar brentp commented on July 28, 2024

so the order in your 2nd block is reversed. I would guess

try:
    return output
finally:
    if hasattr(output, "communicate"):
        output.close()

from pybedtools.

daler avatar daler commented on July 28, 2024

Oops, right. Typo was in the comment; the correct code still has the line 226 problem.

I think the issue is in setting up the Popen object in the first place. The big try-except block in call_bedtools catches the OSError when there are too many files open. The problem though is that this call to call_bedtools should not have its stdout closed . . . really we want to close the first stdout ever opened in this session of pybedtools that is still open. Jeez, does that mean we need to maintain a stack of still-open Popen.stdouts?

from pybedtools.

brentp avatar brentp commented on July 28, 2024

I tried a bunch of stuff for this. I thougth the most promising was:

def stream_iter(p):
    for s in p.stdout:
        yield s
    p.stdout.close()

output = stream_iter(p)

which could work but then it has problems with some of the assumptions later on.
And, since we never actually iterate over the stream in the above test-case, close will still not be called.

So, one could then make a class with a __del__ method that closes the stream...

Note that the BedTool class currently has no __del__ method.

from pybedtools.

brentp avatar brentp commented on July 28, 2024

if the above test-case is really a problem, one can explicitly do c.fn.close()
as a stop-gap.

from pybedtools.

daler avatar daler commented on July 28, 2024

yeah, the bug initially came up with all the random intersections, and I had to use that fn.close() thing in randomintersection:

https://github.com/daler/pybedtools/blob/master/pybedtools/bedtool.py#L1709

but that's inelegant, and requires a user who is doing tons of iterations on their own outside of the randomintersection method (like the test case above) to know they have to fn.close() their BedTools. i played around with __del__ methods too, to no avail . . . hence its continued absence. I'll have to think about your stream_iter idea though.

from pybedtools.

jakebiesinger avatar jakebiesinger commented on July 28, 2024

While iterating over some bed files, I call field_count() to get a particular column. Python crashes without a traceback after precisely 1008 field_count() calls on my machine.

Try the following with a 2000-line bed file:

a = pybedtools.BedTool('test.bed')
for i, b in enumerate(a):
    print i
    x = a.field_count()

and python will choke with Error: The requested bed file (test.bed) could not be opened. Exiting! Somewhere around 1k lines. I can easily work around this by keeping the field_count around, but it seems like the same fishy issue as randomintersect (hence posting here instead of a new issue).

I think the exit on line https://github.com/daler/pybedtools/blob/master/src/bedFile.cpp#L41 is being called and I think the reason is there are too many files open. Perhaps the __iter__ at https://github.com/daler/pybedtools/blob/master/pybedtools/bedtool.py#L618 is not being cleaned up propely? Unlike randomintersection, I can't explicitly fn.close() the bedtool since I'm not done iterating over it.

from pybedtools.

daler avatar daler commented on July 28, 2024

Yeah, this is a tough bug and I still don't have a good solution yet. Iterators don't get cleaned up properly, and deleting a BedTool doesn't get rid of the open file reference.

For the specific case of checking how many fields you're working with. . . It depends on what you're using it for, but checking the length of an Interval directly will be way more efficient, and works no prob on large files:

a = pybedtools.BedTool('test.bed')
for i, b in enumerate(a):
    print i
    x = len(b.fields)

from pybedtools.

daler avatar daler commented on July 28, 2024

as of cdb789c this appears to be working. That is, relevant tests pass, and the previous test-case above works . . . specifically, this used to fail with an OSError because of too many open files, but now works:

import pybedtools
import sys
a = pybedtools.example_bedtool('a.bed')
b = pybedtools.example_bedtool('b.bed')
for i in range(10000):
    sys.stderr.write('\r%s' % i)
    sys.stderr.flush()
    c = a.intersect(b, stream=True)
    len(c)

The fix is just on the iterator side, specifically in IntervalIterator.__next__. See relevant code here. When the stream is being read from subprocess.PIPE, the pipe is closed when it's done reading. An open file will also be closed upon hitting StopIteration; I don't think this latter case will cause any problems, unless the user wants to seek and read inside the file or something.

The part about this fix that I don't fully understand is that if you trap only the StopIteration in IntervalIterator.__next__, it still tries to call next() on the stream, even though immediately after closing a StopIteration has been re-raised. This results in a ValueError because it tried to call next() on a closed file object. Trapping this ValueError as well as the StopIteration seems to have fixed the original issue. But I don't know why the first StopIteration isn't being propagated up to the IntervalIterator caller. Couldn't find any documentation on __next__ methods in Cython that discussed this, either.

It appears to work well for me though. More testing prob needed before closing the issue.

from pybedtools.

daler avatar daler commented on July 28, 2024

Has been working well for a month now, so closing.

from pybedtools.

mlovci avatar mlovci commented on July 28, 2024

I'm getting a similar error with the current version of pybedtools, can you help?:

<type 'exceptions.OSError'>: Too many open files
The command was:

multiBamCov -bed di.bed -bams 1.all.bam 2.all.bam

Things to check:

* Too many files open -- please submit a bug report so that this can be fixed

from pybedtools.

daler avatar daler commented on July 28, 2024

Could you provide some code that reproduces the problem?

from pybedtools.

mlovci avatar mlovci commented on July 28, 2024

I'm struggling to get this to display correctly, maybe because it's just pseudocode, but I hope you get the idea. It should be indented from the "for" to the line with two "##" but I must not know how to use the proper syntax for git.
There is lots of code so I'll try to keep it concise:

(@dalerr edit: you can use three backticks to get nice syntax-highlighted code blocks)

f = open("file with exon coordinates")
exp1 = "bam filename for experiment1"
exp2 = "bam filename for experiment2"
countList = list()

for line in f.readlines():
    upIntron_coords, downIntron_coords = lineparse(line) #parse upstream and downstream intron coordinates
    upIntron = pybedtools.BedTool(upIntron_coords, from_string=True).saveas("uI.bed")
    downIntron = pybedtools.BedTool(downIntron_coords, from_string=True).saveas("dI.bed")
    count_uI = upIntron.multi_bam_coverage(bams=[exp1, exp2], D=True).saveas("countfile")
    del upIntron
    counts_up = parse(count_uI)
    del count_uI
    count_dI = downIntron.multi_bam_coverage(bams=[exp1, exp2], D=True).saveas("countfile")
    del downIntron
    counts_down = parse(count_dI)
    del count_dI
    countList.append([counts_up, counts_down])
    ##gets here about 50 times then raises the exception that there are too many files open
exit()

from pybedtools.

daler avatar daler commented on July 28, 2024

Ah, so you're creating two new BedTool objects for every line. I (or any collaborators) have been unable to successfully free up the open file in the BedTool.__del__ method, which means that calling del does not close the file.

As a result, you will quickly reach your OS's open file limit.

I would recommend creating two complete BED files first out of your input file, something like:

def split_input(fn):
    fout1 = open(fn + '.down')
    fout2 = open(fn + '.up')
    for line in open(fn):
        up, down = lineparse(line)
        fout1.write(up)
        fout2.write(down)
    fout1.close()
    fout2.close()

Then, run multi_bam_coverage on each file, and then parse the results.

bt_up = pybedtools.BedTool(fn + '.up')
bt_dn = pybedtools.BedTool(fn + '.down')
count_uI = bt_up.multi_bam_coverage(bams=[exp1, exp2], D=True).saveas('uI.bed')
count_dI = bt_dn.multi_bam_coverage(bams=[exp1, exp2], D=True).saveas('dI.bed')

# then parse count_uI and count_dI

This way, you only ever create 4 BedTool objects. This should run much faster, too.

See #55 for more info about this, for a user with a similar problem. Time for me to write some docs on this, I think.

from pybedtools.

mlovci avatar mlovci commented on July 28, 2024

Thanks. I'll try to work that in...

from pybedtools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.