Comments (15)
from dask-cuda.
@mrocklin during the weekend I was running some SVD benchmarks again and I've come across a very similar issue, that I think may be related to memory spilling. Could you confirm whether workers start to die when they run out of memory? That's exactly what was happening to me.
from dask-cuda.
Sorry, I meant to say when the worker's GPU runs out of memory.
from dask-cuda.
For the first kind of error, yes. I haven't plugged in the DeviceHostDisk spill mechanism yet.
from dask-cuda.
I was getting those errors even with DeviceHostDisk, unless I had some messed up configuration I didn't notice. That said, it may be that there's a bug and we need to test it better, I will do that soon.
from dask-cuda.
FWIW I've also had some very similar pains with rechunking (particularly in cases where an array needs to be flattened out). Needed a ravel
implementation that avoided rechunking entirely to bypass the issue. Would be happy to try out the current UCX work to see if this helps (or point someone else playing with this to a good test case).
from dask-cuda.
I may be wrong, but I think the issue here is not directly related to rechunking arrays, but rather to running out of device memory.
from dask-cuda.
Yes, I run out of device memory immediately after starting a computation that follows rechunking. Happy to dive into it further with you if it is of interest.
from dask-cuda.
Let me clean things up a bit and write down installation instructions. Then it'd be good to have people dive in. My thought was that @pentschev or @madsbk might be a better fit so that you don't get taken away from driving imaging applications.
from dask-cuda.
I will definitely dive into that, since I have a strong feeling that the memory spilling mechanism may not be working properly, or not active at all. How urgent is this for both of you?
from dask-cuda.
This is very likely related to #57, in fact, probably the same bug on device memory spilling.
from dask-cuda.
So I was checking this, and I can't reproduce any cuRAND errors. What I ultimately get instead is an out of memory error:
Traceback (most recent call last):
File "dask-cuda-59.py", line 10, in <module>
y.compute()
File "/home/nfs/pentschev/miniconda3/envs/rapids-0.7/lib/python3.7/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/home/nfs/pentschev/miniconda3/envs/rapids-0.7/lib/python3.7/site-packages/dask/base.py", line 399, in compute
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "/home/nfs/pentschev/miniconda3/envs/rapids-0.7/lib/python3.7/site-packages/dask/base.py", line 399, in <listcomp>
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "/home/nfs/pentschev/miniconda3/envs/rapids-0.7/lib/python3.7/site-packages/dask/array/core.py", line 828, in finalize
return concatenate3(results)
File "/home/nfs/pentschev/miniconda3/envs/rapids-0.7/lib/python3.7/site-packages/dask/array/core.py", line 3607, in concatenate3
return _concatenate2(arrays, axes=list(range(x.ndim)))
File "/home/nfs/pentschev/miniconda3/envs/rapids-0.7/lib/python3.7/site-packages/dask/array/core.py", line 228, in _concatenate2
return concatenate(arrays, axis=axes[0])
File "/home/nfs/pentschev/miniconda3/envs/rapids-0.7/lib/python3.7/site-packages/cupy/manipulation/join.py", line 49, in concatenate
return core.concatenate_method(tup, axis)
File "cupy/core/_routines_manipulation.pyx", line 563, in cupy.core._routines_manipulation.concatenate_method
File "cupy/core/_routines_manipulation.pyx", line 608, in cupy.core._routines_manipulation.concatenate_method
File "cupy/core/_routines_manipulation.pyx", line 637, in cupy.core._routines_manipulation._concatenate
File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
File "cupy/cuda/memory.pyx", line 518, in cupy.cuda.memory.alloc
File "cupy/cuda/memory.pyx", line 1085, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1106, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 934, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
File "cupy/cuda/memory.pyx", line 949, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
File "cupy/cuda/memory.pyx", line 697, in cupy.cuda.memory._try_malloc
cupy.cuda.memory.OutOfMemoryError: out of memory to allocate 12800000000 bytes (total 38400000000 bytes)
After some checking, I was able to confirm that dask-cuda has only ~5GB in the device LRU, all the rest is temporary CuPy memory (over 30GB). I'm not sure what we can do to make such cases to work, nor if we have an option at all. In this particular case, the amount of memory it tries to allocate is exactly the problem size 40000 * 40000 * 8 = 12800000000
, which CuPy may be trying to allocate for the final result. But unless we can control CuPy internally to spill device memory, I'm not sure we'll be able to support such problem sizes.
I will think a bit more about this, if you have any suggestions, please let me know.
from dask-cuda.
To make sure I understand, the temporary CuPy memory here is likely from some sort of memory manager?
from dask-cuda.
No, I tried also disabling it. The temporary memory could be any intermediary buffers needed, for example, concatenation of multiple arrays or any other functions that can't write to input memory (and thus require some additional memory to store output).
from dask-cuda.
There has been great progress on that over the last year or so, I'm closing this as I don't think this is an issue anymore.
from dask-cuda.
Related Issues (20)
- `dask cuda` CLI doesn't work with click 8.0.x
- Add option to use `cudf` spilling with `dask cuda` HOT 2
- Add cli option to enable pytorch to use same memory pool as rapids. HOT 2
- New version bug HOT 3
- Dask-CUDA 23.12.0 requires `rapids-dask-dependency` on PyPI HOT 2
- Support Pandas version 2 HOT 10
- Support "dataframe.query-planning" config in ``dask.dataframe``
- `test_dataframe_shuffle`: (CuPy's?) CUB emits `warning: cuda_runtime_api.h [jitify] File not found` HOT 3
- `dask.dataframe` `DeprecationWarning` turns into error (no `dask.dataframe`) HOT 3
- Broken link in explicit_comms docs HOT 1
- Pin newer pandas version HOT 2
- cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable HOT 2
- [CI] add wheel tests in CI HOT 4
- Error when installing latest version (24.6.0) of dask-cuda HOT 2
- CPU Memory Usage for Tasks with CPU-GPU Transfer HOT 28
- Explicit-comms shuffle produces different partitioning than `"tasks"`
- Slurm Cluster CPU Affinity HOT 10
- Allow deregistering cuDF from Dask's spilling mechanism
- Allow enabling CUDF_SPILL via an environment variable
- DASK Deployment using SLURM with GPUs HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask-cuda.