Comments (12)
Thanks for reporting this.
Setting the user file limit is an ugly, ugly solution, and I agree it's unlikely to be sufficient.
I suspect this bug has something with the Cython-wrapped IntervalFile
not getting cleaned up properly. I tried some quick fixes but no luck yet -- I'll continue working on this and post here when it's fixed.
from pybedtools.
Wow, nefarious bug. It seems to have been cause by stdin/stdout/stderr filehandles not getting cleaned up by the subprocess.Popen instances created every time a BEDTools command is called.
This is now fixed in 703f7e2 but only for Python 2.7 (see the commit comment at the bottom of that page for details).
For now I'm keeping this in a separate branch until I figure out another way that works for both 2.6 and 2.7. If you're stuck on 2.6, you can redirect stderr to avoid seeing the crazy number of Popen.__del__
errors.
from pybedtools.
Hrmm. I'm getting a gcc: pybedtools/cbedtools.cpp: No such file or directory
after git clone and python setup.py install
. The file isn't in the git repo; is it supposed to be generated via pyrex?
Sorry for the close...
from pybedtools.
Yeah, if you install from pip
or easy_install
, the .cpp files are already compiled to avoid a dependency on Cython. But working with the development code is a little different.
First do this to get the new filehandle branch:
git clone [email protected]:daler/pybedtools.git
git fetch origin filehandle-patch:filehandle-patch
git checkout filehandle-patch
Then do this, which ought to have Cython generate the .cpp file:
python build.py
And finally you can do:
python setup.py install
Does that work?
Also . . . I think I'd rather keep this issue open, since it isn't as elegant as a solution as I'd like.
from pybedtools.
Ah, works perfectly now. I didn't mean to close the issue-- just GitHub's "Comment & Close" button placement...
Makes sense about the Cython dependency. I haven't seen the setup.py and build.py separation before.
from pybedtools.
The build works fine, but I'm getting a strange OSError
:
In [6]: a.randomstats(b, 100000)
<type 'exceptions.OSError'>: Cannot allocate memory
The command was:
intersectBed -a stdin -b /home/wbiesing/src/pybedtools/pybedtools/test/data/b.bed -u
Things to check:
/home/wbiesing/src/pybedtools/pybedtools/helpers.pyc in call_bedtools(cmds, tmpfn, stdin, check_stderr)
299
300 print 'Things to check:'
--> 301 print '\n\t' + '\n\t'.join(problems[err.errno])
302 raise OSError('See above for commands that gave the error')
303
KeyError: 12
I've also seen this show up as:
<type 'exceptions.OSError'>: Cannot allocate memory
The command was:
shuffleBed -i /home/wbiesing/src/pybedtools/pybedtools/test/data/a.bed -g /tmp/pybedtools.JPuSgj.tmp
RAM is not the issue here-- usage is very low and not near the 12gb limit on this machine. Removing your try/catch in call_bedtools gives:
OSError Traceback (most recent call last)
/home/wbiesing/<ipython console> in <module>()
/home/wbiesing/src/pybedtools/pybedtools/bedtool.pyc in randomstats(self, other, iterations, **kwargs)
1575 distribution = self.randomintersection(other, iterations=iterations,
1576 **kwargs)
-> 1577 distribution = np.array(list(distribution))
1578
1579 # Median of distribution
/home/wbiesing/src/pybedtools/pybedtools/bedtool.pyc in randomintersection(self, other, iterations, intersect_kwargs, shuffle_kwargs, debug, report_iterations)
1660 sys.stderr.write('\r%s' % i)
1661 sys.stderr.flush()
-> 1662 tmp = self.shuffle(stream=True, **shuffle_kwargs)
1663 tmp2 = tmp.intersect(other, stream=True, **intersect_kwargs)
1664
/home/wbiesing/src/pybedtools/pybedtools/bedtool.pyc in decorated(self, *args, **kwargs)
372 # this calls the actual method in the first place; *result* is
373 # whatever you get back
--> 374 result = method(self, *args, **kwargs)
375
376 # add appropriate tags
/home/wbiesing/src/pybedtools/pybedtools/bedtool.pyc in wrapped(self, *args, **kwargs)
174 # Do the actual call
175 process, stream = call_bedtools(cmds, tmp, stdin=stdin,
--> 176 check_stderr=check_stderr)
177 result = BedTool(stream)
178 result.process = process
/home/wbiesing/src/pybedtools/pybedtools/helpers.py in call_bedtools(cmds, tmpfn, stdin, check_stderr)
255 stdout=subprocess.PIPE,
256 stderr=subprocess.PIPE,
--> 257 bufsize=1)
258 output = p.stdout
259 stderr = None
/usr/lib/python2.7/subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
670 p2cread, p2cwrite,
671 c2pread, c2pwrite,
--> 672 errread, errwrite)
673
674 if mswindows:
/usr/lib/python2.7/subprocess.pyc in _execute_child(self, args, executable, preexec_fn, close_fds, cwd, env, universal_newlines, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite)
1113 gc.disable()
1114 try:
-> 1115 self.pid = os.fork()
1116 except:
1117 if gc_was_enabled:
OSError: [Errno 12] Cannot allocate memory
and suddenly the terminal that ran this process can't fork any more jobs
wbiesing@cross:~$ ls /tmp/ | wc -l
bash: fork: Cannot allocate memory
which makes it look like too many processes have been forked without cleanup...?
Also, it would be nice if the temporary files created by pybedtools were cleaned up--
wbiesing@cross:~$ ls /tmp/ | wc -l
43386
from pybedtools.
Hmm. Seems like more subprocess annoyances . . . this may take some time to debug. I'm worried pybedtools may be pushing subprocess in ways it wasn't designed for, at least for the randomintersection
use-case.
As for the cleanup, things should automatically get cleaned up if everything exits normally (thanks to atexit.register(cleanup)
in helpers.py
). Not sure how to gracefully handle post-crash cleanup though -- I've found it useful for debugging to have the tempfiles stick around after a crash. And I'm worried that deleting tempfiles before exit could result in data loss. Any ideas?
from pybedtools.
Still having trouble with the memory problem.
However, while debugging I found that in Python 2.6, if I use a processes
value > 1 when calling randomstats()
, the Popen.__del__
warning messages go away, providing an inadvertent fix. Sweet.
I'm not fluent with multiprocessing. I tried removing the if processes > 1
conditional, hoping to just run any call to randomstats()
through a Pool albeit with one worker. Doing so results in my new favorite error:
AssertionError: daemonic processes are not allowed to have children
(probably good universal advice in addition to an assertion error)
Do you know of a way of letting a Pool run with a single process? If so, this could fix the py2.6 problem so these changes could be merged into the master branch
edit: never mind about the merging . . . everything would have to be run through a Pool to get this benefit.
from pybedtools.
You'll get infinite recursion if you remove the conditional completely. The excellent error message is catching the Pool'ed process trying to create its own Pool (which would then create its own, etc). Better to do something like:
diff --git a/pybedtools/bedtool.py b/pybedtools/bedtool.py
index 33ec207..c2d9bf6 100644
--- a/pybedtools/bedtool.py
+++ b/pybedtools/bedtool.py
@@ -1648,7 +1648,7 @@ class BedTool(object):
[2, 2, 2, 0, 2, 3, 2, 1, 2, 3]
"""
- if processes > 1:
+ if processes is not None:
p = multiprocessing.Pool(processes)
iterations_each = [iterations / processes] * processes
iterations_each[-1] += iterations % processes
diff --git a/pybedtools/helpers.py b/pybedtools/helpers.py
index b0f8db2..e181ff9 100644
--- a/pybedtools/helpers.py
+++ b/pybedtools/helpers.py
@@ -377,4 +377,4 @@ def _call_randomintersect(_self, other, iterations, intersect_kwargs,
intersect_kwargs=intersect_kwargs,
shuffle_kwargs=shuffle_kwargs,
report_iterations=report_iterations,
- debug=False, processes=1))
+ debug=False, processes=None))
from pybedtools.
Ah, right. Thanks for the info.
Since the memory issue, and general subprocess
module woes, are proving problematic, I'm thinking of using a naive os.system
call inside randomintersect rather than relying on the pybedtools subprocess
mechanism -- essentially special-casing the code that will be run hundreds of thousands of times.
from pybedtools.
Short answer: I think all of these issues are fixed as of 1ddb49c. Can you please test?
Long answer and notes to self:
randomintersection()
calls __len__
, which calls count
, which calls __iter__
which dispatches to the correct iterator based on the kind of BedTool it is.
For file-based BedTools, the iterator is a Cython IntervalFile
. This appears to leave an open file hanging somewhere, even when the BedTool is deleted (circular references somewhere?). This still needs to be addressed, but there's a way around it: using a stream-based BedTool.
For stream-based BedTools, calling __iter__
returns a Cython IntervalIterator
. The underlying BedTool.fn
, though, is the stdout of the subprocess.Popen
call, rather than the filename passed to IntervalFile
for file-based BedTools. That means the fn
can be specifically closed, freeing up the open file. So the simple fix can be seen here: 1ddb49c#L0R1693.
However, adding logic to BedTool.__del__
, like:
def __del__(self):
if isinstance(self.fn, file):
self.fn.close()
does not fix the problem. My guess is that this is due to circular references that need to be tracked down and broken (i.e., search object.__del__
on http://docs.python.org/reference/datamodel.html).
So: the issue is fixed, but it will be nice to have __del__
do all the work so that users creating hundreds of thousands of BedTools outside of the randomintersection
method won't run into this issue.
from pybedtools.
Yeah the fix looks good to me, even on my larger files and 100's of thousands of iterations and several processes. Thanks for the update!
from pybedtools.
Related Issues (20)
- Support Python 3.10 and 3.11 HOT 1
- "python setup.py bdist_wheel did not run successfully" when pip installing with python v3.11 HOT 8
- to_dataframe() creates 0th row with generic names in nucleotide_content HOT 2
- build failure under python 3.11 HOT 6
- pybedtools intersect error HOT 2
- Cannot create a BedTool object from list of regions that uses np.int64 coordinates
- remove historical py27 support HOT 1
- bedtools intersect reported incorrect interval intersection HOT 3
- Cythonizing files requires `language_level=2` to be set in cythonize() HOT 4
- pybedtools multi_bam_coverage assistance HOT 2
- "fastaFromBed" error HOT 2
- intersect with multiple -b arguments not working with -sorted HOT 1
- Unable to install pybedtools==0.9.1 in Python3.10 HOT 4
- Len modifying the Bedtools after a filter HOT 2
- Has pybedtools considered packaging bedtools? HOT 3
- how to mask gap regions for randomization? HOT 1
- Issue while doing pip install pybedtools HOT 3
- Inconsistent behaviour when using files from `pathlib.PosixPath` with BedTool functions...
- pybedtools.bedtool.Bedtool.sort()
- total_coverage giving a incorrect value
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pybedtools.