Comments (7)
If you have use nvidia-smi to hide gpu, index will be change, number of device will be decrease. Such as
nvidia-smi drain -p <id> -m 0
id of gpu devices never delete in scheduler when number gpu decrease
volcano-sh/volcano#2215
from devices.
from devices.
/bug
This bug has also been mentioned in the Volcano wechat group.
from devices.
/cc @peiniliu @william-wang Can you help for that?
from devices.
Can you give the reproduce steps and environment? Let me take a try.
from devices.
If you have use nvidia-smi to hide gpu, index will be change, number of device will be decrease. Such as
nvidia-smi drain -p <id> -m 0
Or some gpu is damage,it disable by drive, such as
When use nvidia-smi to hide gpu, for example is hide /dev/nvidia0, device plugin init
deviceByIndex := map[uint]string{}
for i := uint(0); i < n; i++ {
d, err := nvml.NewDevice(i)
check(err)
var id uint
_, err = fmt.Sscanf(d.Path, "/dev/nvidia%d", &id)
check(err)
deviceByIndex[id] = d.UUID
// TODO: Do we assume all cards are of same capacity
}
so device plugint do not have index 0 gpu, but scheduler start gpu index is 0, so if will fail to Warning UnexpectedAdmissionError 79s kubelet Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected error
from devices.
same issue volcano-sh/volcano#2701 use Volcano 1.7 and https://github.com/volcano-sh/devices/blob/release-1.0/volcano-device-plugin.yml
I did not use
nvidia-smi drain -p <id> -m 0
from devices.
Related Issues (20)
- gpu-memory is just for a gpu memery claim? HOT 2
- volcano.sh/gpu-memory: 0 HOT 4
- ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100 HOT 16
- Add MIG Device support HOT 2
- Improve the logic of finding candidate pod in Allocate RPC HOT 1
- gpu number无法使用 HOT 8
- device-plugin painc HOT 20
- The Docker image tag in volcano-device-plugin.yaml is incorrect, and it only has an x86 version, there is no arm64 version available.
- The `args` configuration for the `containers` in the `volcano-device-plugin.yaml` is incorrect.
- master code can't run
- 使用gpu-number 导致 schduler crash HOT 4
- What version of cuda does vgpu support?
- volcano device plugin请问是否可以在arm架构中编译源码?是否有开发arm架构的计划?
- Can I compile the source code in the arm architecture for volcano device plugin? Is there a plan to develop an arm architecture
- run pod failed
- volcano.sh/vgpu-memory and volcano.sh/vgpu-number
- Unexpected Admission Error
- Failure of Device Plugin Communication with Kubernetes Kubelet
- vgpu 并发调度pod时,显存混乱
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from devices.