Giter Site home page Giter Site logo

Comments (12)

daler avatar daler commented on July 28, 2024

Thanks for reporting this.

Setting the user file limit is an ugly, ugly solution, and I agree it's unlikely to be sufficient.

I suspect this bug has something with the Cython-wrapped IntervalFile not getting cleaned up properly. I tried some quick fixes but no luck yet -- I'll continue working on this and post here when it's fixed.

from pybedtools.

daler avatar daler commented on July 28, 2024

Wow, nefarious bug. It seems to have been cause by stdin/stdout/stderr filehandles not getting cleaned up by the subprocess.Popen instances created every time a BEDTools command is called.

This is now fixed in 703f7e2 but only for Python 2.7 (see the commit comment at the bottom of that page for details).

For now I'm keeping this in a separate branch until I figure out another way that works for both 2.6 and 2.7. If you're stuck on 2.6, you can redirect stderr to avoid seeing the crazy number of Popen.__del__ errors.

from pybedtools.

jakebiesinger avatar jakebiesinger commented on July 28, 2024

Hrmm. I'm getting a gcc: pybedtools/cbedtools.cpp: No such file or directory after git clone and python setup.py install. The file isn't in the git repo; is it supposed to be generated via pyrex?

Sorry for the close...

from pybedtools.

daler avatar daler commented on July 28, 2024

Yeah, if you install from pip or easy_install, the .cpp files are already compiled to avoid a dependency on Cython. But working with the development code is a little different.

First do this to get the new filehandle branch:

git clone [email protected]:daler/pybedtools.git
git fetch origin filehandle-patch:filehandle-patch
git checkout filehandle-patch

Then do this, which ought to have Cython generate the .cpp file:

python build.py

And finally you can do:

python setup.py install

Does that work?

Also . . . I think I'd rather keep this issue open, since it isn't as elegant as a solution as I'd like.

from pybedtools.

jakebiesinger avatar jakebiesinger commented on July 28, 2024

Ah, works perfectly now. I didn't mean to close the issue-- just GitHub's "Comment & Close" button placement...

Makes sense about the Cython dependency. I haven't seen the setup.py and build.py separation before.

from pybedtools.

jakebiesinger avatar jakebiesinger commented on July 28, 2024

The build works fine, but I'm getting a strange OSError:

In [6]: a.randomstats(b, 100000)
<type 'exceptions.OSError'>: Cannot allocate memory
The command was:

    intersectBed -a stdin -b /home/wbiesing/src/pybedtools/pybedtools/test/data/b.bed -u

Things to check:

/home/wbiesing/src/pybedtools/pybedtools/helpers.pyc in call_bedtools(cmds, tmpfn, stdin, check_stderr)
    299 
    300         print 'Things to check:'
--> 301         print '\n\t' + '\n\t'.join(problems[err.errno])
    302         raise OSError('See above for commands that gave the error')
    303 

KeyError: 12

I've also seen this show up as:

<type 'exceptions.OSError'>: Cannot allocate memory
The command was:

    shuffleBed -i /home/wbiesing/src/pybedtools/pybedtools/test/data/a.bed -g /tmp/pybedtools.JPuSgj.tmp

RAM is not the issue here-- usage is very low and not near the 12gb limit on this machine. Removing your try/catch in call_bedtools gives:

OSError                                   Traceback (most recent call last)

/home/wbiesing/<ipython console> in <module>()

/home/wbiesing/src/pybedtools/pybedtools/bedtool.pyc in randomstats(self, other, iterations, **kwargs)
   1575         distribution = self.randomintersection(other, iterations=iterations,
   1576                                                **kwargs)
-> 1577         distribution = np.array(list(distribution))
   1578 
   1579         # Median of distribution


/home/wbiesing/src/pybedtools/pybedtools/bedtool.pyc in randomintersection(self, other, iterations, intersect_kwargs, shuffle_kwargs, debug, report_iterations)
   1660                     sys.stderr.write('\r%s' % i)
   1661                     sys.stderr.flush()
-> 1662             tmp = self.shuffle(stream=True, **shuffle_kwargs)
   1663             tmp2 = tmp.intersect(other, stream=True, **intersect_kwargs)
   1664 

/home/wbiesing/src/pybedtools/pybedtools/bedtool.pyc in decorated(self, *args, **kwargs)
    372             # this calls the actual method in the first place; *result* is

    373             # whatever you get back

--> 374             result = method(self, *args, **kwargs)
    375 
    376             # add appropriate tags

/home/wbiesing/src/pybedtools/pybedtools/bedtool.pyc in wrapped(self, *args, **kwargs)
    174             # Do the actual call

    175             process, stream = call_bedtools(cmds, tmp, stdin=stdin,
--> 176                                    check_stderr=check_stderr)
    177             result = BedTool(stream)
    178             result.process = process

/home/wbiesing/src/pybedtools/pybedtools/helpers.py in call_bedtools(cmds, tmpfn, stdin, check_stderr)
    255                                  stdout=subprocess.PIPE,
    256                                  stderr=subprocess.PIPE,
--> 257                                  bufsize=1)
    258             output = p.stdout
    259             stderr = None

/usr/lib/python2.7/subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
    670                             p2cread, p2cwrite,
    671                             c2pread, c2pwrite,
--> 672                             errread, errwrite)
    673 
    674         if mswindows:

/usr/lib/python2.7/subprocess.pyc in _execute_child(self, args, executable, preexec_fn, close_fds, cwd, env, universal_newlines, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite)
   1113                     gc.disable()
   1114                     try:
-> 1115                         self.pid = os.fork()
   1116                     except:
   1117                         if gc_was_enabled:

OSError: [Errno 12] Cannot allocate memory

and suddenly the terminal that ran this process can't fork any more jobs

wbiesing@cross:~$ ls /tmp/ | wc -l
bash: fork: Cannot allocate memory

which makes it look like too many processes have been forked without cleanup...?
Also, it would be nice if the temporary files created by pybedtools were cleaned up--

wbiesing@cross:~$ ls /tmp/ | wc -l
43386

from pybedtools.

daler avatar daler commented on July 28, 2024

Hmm. Seems like more subprocess annoyances . . . this may take some time to debug. I'm worried pybedtools may be pushing subprocess in ways it wasn't designed for, at least for the randomintersection use-case.

As for the cleanup, things should automatically get cleaned up if everything exits normally (thanks to atexit.register(cleanup) in helpers.py). Not sure how to gracefully handle post-crash cleanup though -- I've found it useful for debugging to have the tempfiles stick around after a crash. And I'm worried that deleting tempfiles before exit could result in data loss. Any ideas?

from pybedtools.

daler avatar daler commented on July 28, 2024

Still having trouble with the memory problem.

However, while debugging I found that in Python 2.6, if I use a processes value > 1 when calling randomstats(), the Popen.__del__ warning messages go away, providing an inadvertent fix. Sweet.

I'm not fluent with multiprocessing. I tried removing the if processes > 1 conditional, hoping to just run any call to randomstats() through a Pool albeit with one worker. Doing so results in my new favorite error:

AssertionError: daemonic processes are not allowed to have children

(probably good universal advice in addition to an assertion error)

Do you know of a way of letting a Pool run with a single process? If so, this could fix the py2.6 problem so these changes could be merged into the master branch

edit: never mind about the merging . . . everything would have to be run through a Pool to get this benefit.

from pybedtools.

jakebiesinger avatar jakebiesinger commented on July 28, 2024

You'll get infinite recursion if you remove the conditional completely. The excellent error message is catching the Pool'ed process trying to create its own Pool (which would then create its own, etc). Better to do something like:

diff --git a/pybedtools/bedtool.py b/pybedtools/bedtool.py
index 33ec207..c2d9bf6 100644
--- a/pybedtools/bedtool.py
+++ b/pybedtools/bedtool.py
@@ -1648,7 +1648,7 @@ class BedTool(object):
             [2, 2, 2, 0, 2, 3, 2, 1, 2, 3]

         """
-        if processes > 1:
+        if processes is not None:
             p = multiprocessing.Pool(processes)
             iterations_each = [iterations / processes] * processes
             iterations_each[-1] += iterations % processes
diff --git a/pybedtools/helpers.py b/pybedtools/helpers.py
index b0f8db2..e181ff9 100644
--- a/pybedtools/helpers.py
+++ b/pybedtools/helpers.py
@@ -377,4 +377,4 @@ def _call_randomintersect(_self, other, iterations, intersect_kwargs,
                                         intersect_kwargs=intersect_kwargs,
                                         shuffle_kwargs=shuffle_kwargs,
                                         report_iterations=report_iterations,
-                                        debug=False, processes=1))
+                                        debug=False, processes=None))

from pybedtools.

daler avatar daler commented on July 28, 2024

Ah, right. Thanks for the info.

Since the memory issue, and general subprocess module woes, are proving problematic, I'm thinking of using a naive os.system call inside randomintersect rather than relying on the pybedtools subprocess mechanism -- essentially special-casing the code that will be run hundreds of thousands of times.

from pybedtools.

daler avatar daler commented on July 28, 2024

Short answer: I think all of these issues are fixed as of 1ddb49c. Can you please test?

Long answer and notes to self:
randomintersection() calls __len__, which calls count, which calls __iter__ which dispatches to the correct iterator based on the kind of BedTool it is.

For file-based BedTools, the iterator is a Cython IntervalFile. This appears to leave an open file hanging somewhere, even when the BedTool is deleted (circular references somewhere?). This still needs to be addressed, but there's a way around it: using a stream-based BedTool.

For stream-based BedTools, calling __iter__ returns a Cython IntervalIterator. The underlying BedTool.fn, though, is the stdout of the subprocess.Popen call, rather than the filename passed to IntervalFile for file-based BedTools. That means the fn can be specifically closed, freeing up the open file. So the simple fix can be seen here: 1ddb49c#L0R1693.

However, adding logic to BedTool.__del__, like:

def __del__(self):
    if isinstance(self.fn, file):
        self.fn.close()

does not fix the problem. My guess is that this is due to circular references that need to be tracked down and broken (i.e., search object.__del__ on http://docs.python.org/reference/datamodel.html).

So: the issue is fixed, but it will be nice to have __del__ do all the work so that users creating hundreds of thousands of BedTools outside of the randomintersection method won't run into this issue.

from pybedtools.

jakebiesinger avatar jakebiesinger commented on July 28, 2024

Yeah the fix looks good to me, even on my larger files and 100's of thousands of iterations and several processes. Thanks for the update!

from pybedtools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.