Comments (14)
from dask-cuda.
I agree that this would be great to have. It would also be interesting to think about what other dashboards we might provide for users about their GPUs. It would also be interesting to think about dashboards that might be useful outside of the context of Dask. I know that some folks have been interested in this generally. I'd be happy to help anyone that wanted to push on this effort longer term.
Long term We might want to use a library like pynvml to do this with a little less overhead.
Those are both long terms comments. I have no strong objection to using nvidia-smi
as well short term, providing that it's not too expensive to run (I suspect that this blocks the worker every time we poll for resources).
from dask-cuda.
One other thing that might be interesting when thinking about memory specifically is leveraging Python's tracemalloc. This would generally give us a way of tracking memory allocations and that input could be fed into other things like dashboards or even used by scripts, libraries, or other user applications. To use this we would need to register those allocations ourselves. It could be something that RMM would do in the Python interface level. Alternatively several different libraries could do this and we could filter out their individual contributions to memory usage.
That said, pynvml
is useful for more than just memory allocations. So it would certainly be useful for a variety of interesting diagnostic feedback.
from dask-cuda.
As dask_cudf usage has grown, the number of one-off scripts for monitoring GPU memory are starting to proliferate.
Is there someone who can work on a first version of built-in GPU memory monitoring?
from dask-cuda.
from dask-cuda.
It's a start. When debugging workflows, there's often a lengthy pause while someone driving a notebook jumps back to (multiple) terminals to check GPU memory usage via nvidia-smi
.
Having the dashboard show total GPU memory used by persisted DataFrames would be a great first step in alleviating that pain.
I would think a future improvement would be surfacing peak memory used, since DataFrame operations often cause significant, albeit temporary spikes in GPU memory usage.
Beyond that, being having access to display this type of data tells me that we might be able to determine that a given workflow could benefit safely from using multiple threads or processes per GPU worker.
from dask-cuda.
If you're on a single GPU then you probably want the solution in https://github.com/rjzamora/jupyterlab-bokeh-server/tree/pynvml
This has been done for a while, we just needed to package it up (which requires people comfortable with both Python and JS, which is somewhat rare).
@jacobtomlinson seemed interested in doing this. Jacob, if this is easy for you and not too much of a distraction could you prioritize it?
from dask-cuda.
Regardless, I imagine I'll have the Dask version done in a week or two. It could be done sooner if this is a high priority for you Randy. I get the sense that it's only a mild annoyance for now and not a burning fire, but I could be wrong.
from dask-cuda.
You read me well =)
I'm glad to know that there's been significant progress in the meantime.
from dask-cuda.
There is an initial pair of plots in dask/distributed#2944 .
They won't be very discoverable until dask/dask-labextension#75
But you can navigate to them directly by going to /individual-gpu-memory
and /individual-gpu-utilization
of the scheduler's dashboard address, often :8787
.
We might ask someone like @rjzamora to expand on that work, but I still think that, if you're on a single node then, you're probably better off with his existing project.
from dask-cuda.
I think https://github.com/rapidsai/jupyterlab-nvdashboard addresses most of the requests (if not all). Regardless, I think that's now a more appropriate project for future feature requests than dask-cuda, therefore I'm closing this but feel free to reopen should something in dask-cuda still be necessary.
from dask-cuda.
Hi All,
Looks like this functionality has been out for a while. One of our users needs to get GPU metrics from dask-cuda-workers. In this set up the scheduler is running in one node and the dask-cuda-workers in different nodes.
The question here is what /individual-gpu-memory
and /individual-gpu-utilization
GPU refer to ?, GPU(s) available in a local cuda cluster ?
On the other hand, I also tried the dask-labextension, but it only shows GPU metrics from the machine running the scheduler.
This is the Dask related packages in use:
$ conda list | grep dask
dask 2.19.0 py_0 conda-forge
dask-core 2.19.0 py_0
dask-cuda 0.14.1 pypi_0 pypi
dask-cudf 0.14.0a0+5439.g6244cfc.dirty pypi_0 pypi
dask-labextension 2.0.2 py_0 conda-forge
dask_labextension 2.0.2 0 conda-forge
and Jupyter
jupyter-server-proxy 1.5.0 py_0 conda-forge
jupyter_client 6.1.6 py_0
jupyter_core 4.6.1 py37_0
jupyterlab 2.1.5 py_0 conda-forge
jupyterlab-nvdashboard 0.3.1 py_0 conda-forge
jupyterlab_server 1.2.0 py_0
from dask-cuda.
I'm not sure about any of those questions. @quasiben @jakirkham are you familiar with those?
from dask-cuda.
Could we please move this over to a new issue?
Edit: Also would recommend including a screenshot if you can. That should make it a bit clearer what's going on
from dask-cuda.
Related Issues (20)
- Cannot start worker in rapids 23.02 HOT 1
- test_cuda_visible_devices_and_memory_limit_and_nthreads spews benign (?) warnings on systems with fewer than eight GPUs HOT 3
- Drop `pickle.HIGHEST_PROTOCOL` check
- Explicit comms tests are flaky HOT 1
- Dask-Cudf + XGBoost fails with jit_unspill=True HOT 3
- Failing Out-of-Core Merge
- Can't start LocalCUDACluster w/ ucx HOT 22
- [FEA] Expose ability to configure cudf's copy on write in LocalCUDACluster
- Is the use of `__main__` necessary for `LocalCUDACluster`? HOT 3
- Heteogenous CUDA Cluster HOT 6
- dask_cuda memory error when calculating dask.array correlation HOT 10
- errors while use ucx with {rmm_managed_memory: True} option HOT 3
- Problem in LocalCUDACluster creating worker HOT 6
- Error serializing CuPy==12.0.0 array HOT 6
- [BUG] dask scheduler not working: dask_cuda/cli.py AttributeError: 'function' object has no attribute 'command' HOT 10
- Make `dask_cuda.cli runnable` as a script with `python -m`
- Stop Using Versioneer HOT 1
- Remove versioneer
- Use KvikIO when available even when GDS isn't available
- Conda-forge packaging dependencies HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask-cuda.