volcano-sh / devices Goto Github PK
View Code? Open in Web Editor NEWDevice plugins for Volcano, e.g. GPU
License: Apache License 2.0
Device plugins for Volcano, e.g. GPU
License: Apache License 2.0
Deploying different types of GPU on the same nodes is very common in production environment. However it is not supported in device plugin so far. We are going to support it and make it work with Volcano capacity scheduling so that use are able to configure different quota for different type of GPU.
Currently, we did not support prow for this repo which make it unconvience for contributors :)
For k8s v1.17.9 UnexpectedAdmissionError due to lack of pod update verbs
Warning UnexpectedAdmissionError 10s kubelet, amax-pcl Update plugin resources failed due to rpc error: code = Unknown desc = failed to update pod annotation pods "pod1" is forbidden: User "system:serviceaccount:kube-system:volcano-device-plugin" cannot update resource "pods" in API group "" in the namespace "default", which is unexpected.
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
Adding support for MIG devices - the pull request is still WIP - only to review general idea
#20
As we're going to support different devices, so prefer to iolated the code by directory; and share the common util at package, e.g. pkg/common.
I found the example missing specify volcano.sh/gpu-index
annotation
If not specify volcano.sh/gpu-index
annotation when want to allocate volcano.sh/gpu-memory
, the pod will always fail to be created.
The Kubernetes kubelet service on the host is encountering intermittent failures in communicating with the device plugin, specifically regarding the volcano.sh/vgpu-number resource.
Allocatable:
volcano.sh/gpu-memory: 0
volcano.sh/gpu-number: 2
volcano.sh/vgpu-number: 8
Mar 19 14:31:09 dell-63 kubelet[1067]: E0319 14:31:09.987111 1067 endpoint.go:107] "listAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resourceName="volcano.sh/vgpu-number"
Mar 19 14:31:09 dell-63 kubelet[1067]: W0319 14:31:09.987121 1067 clientconn.go:1326] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/device-plugins/nvidia-gpu.sock /var/lib/kubelet/device-plugins/nvidia-gpu.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/device-plugins/nvidia-gpu.sock: connect: connection refused". Reconnecting...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 12s default-scheduler Successfully assigned crater-jobs/gpu-pod11111 to dell-63
Warning UnexpectedAdmissionError 12s kubelet Allocate failed due to rpc error: code = Unavailable desc = error reading from server: EOF, which is unexpected
I0319 14:51:12.329385 1 main.go:77] Loading NVML
I0319 14:51:12.356024 1 main.go:91] Starting FS watcher.
I0319 14:51:12.356095 1 main.go:98] Starting OS watcher.
I0319 14:51:12.369951 1 main.go:116] Retreiving plugins.
I0319 14:51:12.370235 1 register.go:101] into WatchAndRegister
2024/03/19 14:51:12 Starting GRPC server for 'volcano.sh/vgpu-number'
2024/03/19 14:51:12 Starting to serve 'volcano.sh/vgpu-number' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2024/03/19 14:51:12 Registered device plugin for 'volcano.sh/vgpu-number' with Kubelet
I0319 14:51:12.397816 1 register.go:89] Reporting devices GPU-01020ff8-5d55-31c9-9d5d-e433bef1579d,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false:GPU-50dbcaf2-0289-c862-c69c-5c0a967021a1,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false: in 2024-03-19 14:51:12.397800845 +0000 UTC m=+0.077900234
I0319 14:51:42.431339 1 register.go:89] Reporting devices GPU-01020ff8-5d55-31c9-9d5d-e433bef1579d,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false:GPU-50dbcaf2-0289-c862-c69c-5c0a967021a1,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false: in 2024-03-19 14:51:42.43132017 +0000 UTC m=+30.111419558
I0319 14:52:12.577834 1 register.go:89] Reporting devices GPU-01020ff8-5d55-31c9-9d5d-e433bef1579d,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false:GPU-50dbcaf2-0289-c862-c69c-5c0a967021a1,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false: in 2024-03-19 14:52:12.577817165 +0000 UTC m=+60.257916570
Socket appears not to start successfully, without error messages?
是否可以通过在 job 的spec 中添加 nvidia.com/a100 或者nvidia.com/gpu-type: a100 将job 调度到指定的显卡上
运行pod报错 Warning UnexpectedAdmissionError 79s kubelet Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected @ @
2024-03-07T10:55:34.101084725Z stderr F I0307 10:55:34.100796 1 register.go:88] Reporting devices GPU-1c5009f6-3637-cd7d-4955-5eaa038e563e,10,32768,NVIDIA-Tesla V100-SXM2-32GB,false:GPU-809c5c58-ad8f-c998-2447-4ac9befe0fdb,10,32768,NVIDIA-Tesla V100-SXM2-32GB,false:GPU-e1fde3ec-1842-57b1-5862-3585e22923d1,10,32768,NVIDIA-Tesla V100-SXM2-32GB,false:GPU-cd8f49bf-1607-360b-8bf3-3defa6a58bb0,10,32768,NVIDIA-Tesla V100-SXM2-32GB,false: in 2024-03-07 10:55:34.100782062 +0000 UTC m=+600.800399211
2024-03-07T10:55:47.977734126Z stderr F I0307 10:55:47.977559 1 plugin.go:296] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-e1fde3ec-1842-57b1-5862-3585e22923d1-3 GPU-809c5c58-ad8f-c998-2447-4ac9befe0fdb-2],}]
2024-03-07T10:55:48.374277561Z stderr F panic: runtime error: invalid memory address or nil pointer dereference
2024-03-07T10:55:48.374309502Z stderr F [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1092953]
2024-03-07T10:55:48.374313183Z stderr F
2024-03-07T10:55:48.374316238Z stderr F goroutine 234 [running]:
2024-03-07T10:55:48.374319213Z stderr F volcano.sh/k8s-device-plugin/pkg/plugin/vgpu.(*NvidiaDevicePlugin).Allocate(0xc0001bc280, {0x14cff80, 0xc00045c720}, 0xc0004427c0)
2024-03-07T10:55:48.374321927Z stderr F /go/src/volcano.sh/devices/pkg/plugin/vgpu/plugin.go:313 +0x353
2024-03-07T10:55:48.374324767Z stderr F k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1._DevicePlugin_Allocate_Handler({0x1292040?, 0xc0001bc280}, {0x14cff80, 0xc00045c720}, 0xc000516960, 0x0)
2024-03-07T10:55:48.374327831Z stderr F /go/pkg/mod/k8s.io/[email protected]/pkg/apis/deviceplugin/v1beta1/api.pb.go:1192 +0x170
2024-03-07T10:55:48.374330155Z stderr F google.golang.org/grpc.(*Server).processUnaryRPC(0xc000183040, {0x14d4e98, 0xc000322180}, 0xc000148c00, 0xc0001a6180, 0x1ce57f8, 0x0)
2024-03-07T10:55:48.374349455Z stderr F /go/pkg/mod/google.golang.org/[email protected]/server.go:1082 +0xcab
2024-03-07T10:55:48.374354779Z stderr F google.golang.org/grpc.(*Server).handleStream(0xc000183040, {0x14d4e98, 0xc000322180}, 0xc000148c00, 0x0)
2024-03-07T10:55:48.374357667Z stderr F /go/pkg/mod/google.golang.org/[email protected]/server.go:1405 +0xa13
2024-03-07T10:55:48.374360438Z stderr F google.golang.org/grpc.(*Server).serveStreams.func1.1()
2024-03-07T10:55:48.374373665Z stderr F /go/pkg/mod/google.golang.org/[email protected]/server.go:746 +0x98
2024-03-07T10:55:48.374376854Z stderr F created by google.golang.org/grpc.(*Server).serveStreams.func1
2024-03-07T10:55:48.374379632Z stderr F /go/pkg/mod/google.golang.org/[email protected]/server.go:744 +0xea
Facing the above error when tried to create a pod with vGPU enabled.
Request from #11
Refer the GPU share user doc (https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_gpu_sharing.md).
Currently, only support specify GPU share memory, to specify GPU number is a limitation.
We need support volcano.sh/gpu-number
just like nvidia gpu plugin to support specify GPU number for pod resource request
Is it possible to run the volcano device plugin on containerd? If so, how?
We got the message:
Loading NVML
Failed to initialize NVML: could not load NVML library.
If this is a GPU node, did you set the docker default runtime to nvidia
?
This issue is an extension of #18
What happened:
Applying volcano-device-plugin on a server using 8*V100 GPU, but get volcano.sh/gpu-memory:0 when describe nodes:
Same situation did not occur when using T4 or P4.
Tracing kubelet logs, found following error message:
seems like sync message is too large.
What caused this bug:
volcano-device-plugin mock GPUs into a device list(every device in this list is considered as a 1MB memory block), so that different workloads can share one GPU through kubernetes device plugin mechanism. When large memory GPU such as V100 is implemented, the size of device list exceeds the bound, and ListAndWatch failed as a result.
Solutions:
The key is to minimize the size of the device list, so we can consider each device as a 10MB memory block and reform the whole bookkeeping process according to this assumption. This accuracy is enough for almost all production environments.
Will the applied 12288MiB graphics memory be evenly allocated to two cards when volcano. sh/vgpu memory means "12288" and volcano. sh/vgpu number means "2"
执行下面的命令,同时调度2个pod,一个分配24576M显存,一个分配600M显存,pod起来后进入容器使用nvidia-smi查看,发现两者的显存是反的,给容器ubuntu-container-24576分配了600M显存,给容器ubuntu-container-600分配了24576显存
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod-1v-24576-1
spec:
schedulerName: volcano
containers:
- name: ubuntu-container-24576
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
volcano.sh/vgpu-number: 1 # requesting 1 vGPUs
volcano.sh/vgpu-memory: 24576
---
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod-1v-600-1
spec:
schedulerName: volcano
containers:
- name: ubuntu-container-600
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
volcano.sh/vgpu-number: 1 # requesting 1 vGPUs
volcano.sh/vgpu-memory: 600
EOF
Maybe is a bug
The yaml is
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: mindx-dls-gpu
namespace: vcjob
spec:
minAvailable: 1
schedulerName: volcano
policies:
- event: PodEvicted
action: RestartJob
maxRetry: 3
queue: default
tasks:
- name: "default-1p"
replicas: 1
template:
metadata:
labels:
app: tf
spec:
containers:
- image: nvidia-train:v1
imagePullPolicy: IfNotPresent
name: cuda-container
command:
- "/bin/bash"
- "-c"
#- "chmod 777 -R /job;cd /job/code/ModelZoo_Resnet50_HC; bash train_start.sh"
args: [ "while true; do sleep 3000000; done;" ]
resources:
requests:
volcano.sh/gpu-number: 1
limits:
volcano.sh/gpu-number: 1
volumeMounts:
- name: timezone
mountPath: /etc/timezone
- name: localtime
mountPath: /etc/localtime
nodeSelector:
accelerator: nvidia-tesla-v100
volumes:
- name: timezone
hostPath:
path: /etc/timezone
- name: localtime
hostPath:
path: /etc/timezone
name:
restartPolicy: OnFailure`
And I use "olcano.sh/gpu-memory" resource is error:
Nov 10 20:11:23 ubuntu560 kubelet[26515]: E1110 20:11:23.149895 26515 manager.go:374] Failed to allocate device plugin resource for pod 28bc8549-e3b9-40f6-8adb-7830f967d97b: rpc error: code = Unknown desc = failed to find gpu id
Nov 10 20:11:23 ubuntu560 kubelet[26515]: W1110 20:11:23.149941 26515 predicate.go:74] Failed to admit pod mindx-dls-gpu-default-1p-0_vcjob(28bc8549-e3b9-40f6-8adb-7830f967d97b) - Update plugin resources failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected.
env:
volcano:v1.0.1
volcano-deviceplugin:v1.0.0
os:ubuntu 18.04 amd64
so is volcano.sh/gpu-memory support?
for now, “volcano.sh/gpu-memory” can essentially only set 3 environment variables based on https://github.com/volcano-sh/devices/blob/master/pkg/plugin/nvidia/server.go#L324 to tell user how much gpu memory he can use in current container?
for example, when set “volcano.sh/gpu-memory: 1024”, environment variables looks like below in container, in fact, process in this container can use at most 4096M gpu memory, even VOLCANO_GPU_ALLOCATED is 1024?
NVIDIA_VISIBLE_DEVICES: “0”
VOLCANO_GPU_ALLOCATED: “1024”
VOLCANO_GPU_TOTAL: “4096”
can anyone help me confirm if I understand correctly?
The tag in the containers field:
image: volcanosh/volcano-device-plugin:1.0.0-ubuntu20.04
can't be found on Docker Hub currently. Is it equivalent to the volcanosh/volcano-device-plugin:latest tag?
Additionally, after actually pulling the latest image, I found it only compatible with the x86 environment. Pods on ARM architecture machines will report:
standard_init_linux.go:220: exec user process caused "exec format error"
libcontainer: container start initialization failed: standard_init_linux.go:220: exec user process caused "exec format error"
as the issue #1181
Is this a BUG REPORT or FEATURE REQUEST?:
Uncomment only one, leave it on its own line:
/kind feature
Currently, only one gpu device will be allocated, but in most scenarios, we hope two or more device on the same server or a few devices on servers can be allocated to a pod.
Thank u!
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: ocr-job
spec:
minAvailable: 1
schedulerName: volcano
queue: default
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 1
name: ocr
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- image: ai-grpc-ocr:v1.4
name: ocr
resources:
requests:
volcano.sh/gpu-number: 1
#nvidia.com/gpu: 1
limits:
volcano.sh/gpu-number: 1
#nvidia.com/gpu: 1
restartPolicy: Never
- replicas: 1
name: ocr-2
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- image: ai-grpc-ocr:v1.4
name: ocr
resources:
requests:
volcano.sh/gpu-number: 1
#nvidia.com/gpu: 1
limits:
volcano.sh/gpu-number: 1
#nvidia.com/gpu: 1
restartPolicy: Never
log
$ k get no
NAME STATUS ROLES AGE VERSION
10.122.2.14 Ready <none> 42d v1.26.1
10.122.2.26 Ready <none> 154m v1.26.1
10.122.2.37 Ready <none> 44m v1.26.1
$ k get po
NAME READY STATUS RESTARTS AGE
ocr-job-ocr-0 0/1 Pending 0 4m33s
ocr-job-ocr-2-0 0/1 Pending 0 4m33s
volcano-admission-7f76fc8cf4-rcp85 1/1 Running 0 35d
volcano-admission-init-785w6 0/1 Completed 0 35d
volcano-controllers-6875c95bd7-zs49k 1/1 Running 0 35d
volcano-scheduler-6dcf84d54d-gcwxm 0/1 CrashLoopBackOff 9 (80s ago) 58m
$ k get po
NAME READY STATUS RESTARTS AGE
ocr-job-ocr-0 0/1 Pending 0 41s
ocr-job-ocr-2-0 0/1 Pending 0 41s
volcano-admission-7f76fc8cf4-rcp85 1/1 Running 0 35d
volcano-admission-init-785w6 0/1 Completed 0 35d
volcano-controllers-6875c95bd7-zs49k 1/1 Running 0 35d
volcano-scheduler-6dcf84d54d-zg4d2 0/1 CrashLoopBackOff 2 (25s ago) 4m20s
I0816 12:30:34.061174 1 allocate.go:180] There are <3> nodes for Job <volcano-system/ocr-job-11507a57-1b68-46ad-83bf-38e0c2d76f99>
I0816 12:30:34.061251 1 predicate_helper.go:74] Predicates failed for task <volcano-system/ocr-job-ocr-0> on node <10.122.2.14>: task volcano-system/ocr-job-ocr-0 on node 10.122.2.14 fit failed: Insufficient volcano.sh/gpu-number
E0816 12:30:34.061385 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 334 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1caa960?, 0x32ac650})
/go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00102cf70?})
/go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x1caa960, 0x32ac650})
/usr/local/go/src/runtime/panic.go:884 +0x212
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.getDevicesIdleGPUs(...)
/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:64
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.predicateGPUbyNumber(0xc000dadac0?, 0x0)
/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:166 +0x41
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.checkNodeGPUNumberPredicate(0xc000c68cf0?, 0x0)
/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:140 +0x3f
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.(*GPUDevices).FilterNode(0x1c8bec0?, 0xc000dadac0)
/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/device_info.go:161 +0x157
volcano.sh/volcano/pkg/scheduler/plugins/predicates.(*predicatesPlugin).OnSessionOpen.func4(0xc000848be0, 0xc0004e0180)
/go/src/volcano.sh/volcano/pkg/scheduler/plugins/predicates/predicates.go:522 +0x16e4
volcano.sh/volcano/pkg/scheduler/framework.(*Session).PredicateFn(0xc001094000, 0xc00100df80?, 0x0?)
/go/src/volcano.sh/volcano/pkg/scheduler/framework/session_plugins.go:615 +0x1ce
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).Execute.func1(0xc000848be0, 0xc0004e0180)
/go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:106 +0x1cb
volcano.sh/volcano/pkg/scheduler/util.(*predicateHelper).PredicateNodes.func1(0xc0002045a0?)
/go/src/volcano.sh/volcano/pkg/scheduler/util/predicate_helper.go:73 +0x3a2
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
/go/src/volcano.sh/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:90 +0x106
created by k8s.io/client-go/util/workqueue.ParallelizeUntil
/go/src/volcano.sh/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:76 +0x1d7
I0816 12:30:34.061465 1 statement.go:352] Discarding operations ...
I0816 12:30:34.061494 1 allocate.go:135] Try to allocate resource to Jobs in Queue <default>
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x15ab261]
goroutine 334 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00102cf70?})
/go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xd7
panic({0x1caa960, 0x32ac650})
/usr/local/go/src/runtime/panic.go:884 +0x212
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.getDevicesIdleGPUs(...)
/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:64
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.predicateGPUbyNumber(0xc000dadac0?, 0x0)
/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:166 +0x41
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.checkNodeGPUNumberPredicate(0xc000c68cf0?, 0x0)
/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:140 +0x3f
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.(*GPUDevices).FilterNode(0x1c8bec0?, 0xc000dadac0)
/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/device_info.go:161 +0x157
volcano.sh/volcano/pkg/scheduler/plugins/predicates.(*predicatesPlugin).OnSessionOpen.func4(0xc000848be0, 0xc0004e0180)
/go/src/volcano.sh/volcano/pkg/scheduler/plugins/predicates/predicates.go:522 +0x16e4
volcano.sh/volcano/pkg/scheduler/framework.(*Session).PredicateFn(0xc001094000, 0xc00100df80?, 0x0?)
/go/src/volcano.sh/volcano/pkg/scheduler/framework/session_plugins.go:615 +0x1ce
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).Execute.func1(0xc000848be0, 0xc0004e0180)
/go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:106 +0x1cb
volcano.sh/volcano/pkg/scheduler/util.(*predicateHelper).PredicateNodes.func1(0xc0002045a0?)
/go/src/volcano.sh/volcano/pkg/scheduler/util/predicate_helper.go:73 +0x3a2
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
/go/src/volcano.sh/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:90 +0x106
created by k8s.io/client-go/util/workqueue.ParallelizeUntil
/go/src/volcano.sh/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:76 +0x1d7
..
3个节点, 10.122.2.26 / 10.122.2.37 是 gpu 机器, ;10.122.2.14是 cpu 机器。
切换成 nvidia.com/gpu 这个资源,直接调度失败。 目前原因未知
是7月份部署的。 latest 镜像
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yamlkubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
Currently in the device plugin Allocate RPC, we need to find the candidate pod according to the container in the request.
If there are multiple gpu containers in one pod, obviously there will be logic problems when finding the candidate pod.
func (m *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
var reqCount uint
for _, req := range reqs.ContainerRequests {
reqCount += uint(len(req.DevicesIDs))
}
responses := pluginapi.AllocateResponse{}
firstContainerReq := reqs.ContainerRequests[0]
firstContainerReqDeviceCount := uint(len(firstContainerReq.DevicesIDs))
availablePods := podSlice{}
pendingPods, err := m.kubeInteractor.GetPendingPodsOnNode()
if err != nil {
return nil, err
}
for _, pod := range pendingPods {
current := pod
if IsGPURequiredPod(¤t) && !IsGPUAssignedPod(¤t) && !IsShouldDeletePod(¤t) {
availablePods = append(availablePods, ¤t)
}
}
sort.Sort(availablePods)
var candidatePod *v1.Pod
for _, pod := range availablePods {
for i, c := range pod.Spec.Containers {
if !IsGPURequiredContainer(&c) {
continue
}
if GetGPUResourceOfContainer(&pod.Spec.Containers[i]) == firstContainerReqDeviceCount {
klog.Infof("Got candidate Pod %s(%s), the device count is: %d", pod.UID, c.Name, firstContainerReqDeviceCount)
candidatePod = pod
goto Allocate
}
}
}
....
The tag in the containers
field:
can't be found on Docker Hub currently. Is it equivalent to the volcanosh/volcano-device-plugin:latest
tag?
Additionally, after actually pulling the latest image, I found it only compatible with the x86 environment. Pods on ARM architecture machines will report:
standard_init_linux.go:220: exec user process caused "exec format error"
libcontainer: container start initialization failed: standard_init_linux.go:220: exec user process caused "exec format error"
When a pod with the volcano resource is run, it crashes
https://github.com/volcano-sh/devices/blob/master/volcano-vgpu-device-plugin.yml
2023/08/14 12:05:25 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:25 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
2023/08/14 12:05:26 Could not start device plugin for 'volcano.sh/gpu-memory': listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 Plugin Volcano-GPU-Plugin failed to start: listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:26 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
2023/08/14 12:05:26 Could not start device plugin for 'volcano.sh/gpu-memory': listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 Plugin Volcano-GPU-Plugin failed to start: listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:26 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
2023/08/14 12:05:26 Could not start device plugin for 'volcano.sh/gpu-memory': listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 Plugin Volcano-GPU-Plugin failed to start: listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:26 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
2023/08/14 12:05:27 Could not start device plugin for 'volcano.sh/gpu-memory': listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:27 Plugin Volcano-GPU-Plugin failed to start: listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:27 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:27 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
When I attempted to install the plugin, the pod gave the following error:
flag provided but not defined: -gpu-strategy
Usage of volcano-device-plugin:
I tried modifying the args
in volcano-device-plugin.yaml
from ["--gpu-strategy=share", "--gpu-memory-factor=1"]
to ["---gpu-strategy=share", "---gpu-memory-factor=1"]
. However, the error message then became:
flag provided but not defined: ----gpu-strategy
Usage of volcano-device-plugin:
In the end, I had to remove the args
to successfully create the pod.
I0615 08:06:10.942719 1 plugin.go:382] Allocate Response [&ContainerAllocateResponse{Envs:map[string]string{CUDA_DEVICE_MEMORY_LIMIT_0: 1024m,CUDA_DEVICE_MEMORY_SHARED_CACHE: /tmp/vgpu/6b5a834e-6fec-47d6-b629-0468cd18ba69.cache,NVIDIA_VISIBLE_DEVICES: GPU-d8794152-5506-fe60-be38-c6ff3d35dbf4,},Mounts:[]*Mount{&Mount{ContainerPath:/usr/local/vgpu/libvgpu.so,HostPath:/usr/local/vgpu/libvgpu.so,ReadOnly:true,},&Mount{ContainerPath:/etc/ld.so.preload,HostPath:/usr/local/vgpu/ld.so.preload,ReadOnly:true,},&Mount{ContainerPath:/tmp/vgpu,HostPath:/tmp/vgpu/containers/1a7defed-9fff-4feb-8921-45cc7ea253f7_vgpu2,ReadOnly:false,},&Mount{ContainerPath:/tmp/vgpulock,HostPath:/tmp/vgpulock,ReadOnly:false,},},Devices:[]*DeviceSpec{},Annotations:map[string]string{},}]
I0615 08:06:11.002096 1 util.go:229] TrySuccess:
I0615 08:06:11.002123 1 util.go:235] AllDevicesAllocateSuccess releasing lock
I0615 08:06:11.338349 1 plugin.go:309] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-d8794152-5506-fe60-be38-c6ff3d35dbf4-5],}]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1092953]
goroutine 68 [running]:
volcano.sh/k8s-device-plugin/pkg/plugin/vgpu4pd.(*NvidiaDevicePlugin).Allocate(0xc00038ec80, {0x14cfee0, 0xc0003ee510}, 0xc0005eca00)
/go/src/volcano.sh/devices/pkg/plugin/vgpu4pd/plugin.go:326 +0x353
k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1._DevicePlugin_Allocate_Handler({0x12920a0?, 0xc00038ec80}, {0x14cfee0, 0xc0003ee510}, 0xc00062e060, 0x0)
/go/pkg/mod/k8s.io/[email protected]/pkg/apis/deviceplugin/v1beta1/api.pb.go:1192 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0003a41a0, {0x14d4df8, 0xc0004cc000}, 0xc0000b2100, 0xc000367aa0, 0x1ce57f8, 0x0)
/go/pkg/mod/google.golang.org/[email protected]/server.go:1082 +0xcab
google.golang.org/grpc.(*Server).handleStream(0xc0003a41a0, {0x14d4df8, 0xc0004cc000}, 0xc0000b2100, 0x0)
/go/pkg/mod/google.golang.org/[email protected]/server.go:1405 +0xa13
google.golang.org/grpc.(*Server).serveStreams.func1.1()
/go/pkg/mod/google.golang.org/[email protected]/server.go:746 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
/go/pkg/mod/google.golang.org/[email protected]/server.go:744 +0xea
Maybe a bug
when i request i gpu for k8s (the yaml in #10), I find the volcano-deviceplugin give all the node gpus to my pod.
the picture is flowed(in /dev/):
I donot know why
k8s:1.17.3
volcano:v1.0.1
volcano-deviceplugin:1.0.0
docker:18.06.3-ce
os : ubuntu18.04
arm : x86
NVIDIA Telas V100 * 8
cuda10.1 driver
volcano-device-plugin
能够获取到volcano.sh/gpu-number: 8
获取不到显存 volcano.sh/gpu-memory: 0
这是什么原因?
What version of cuda does vgpu support?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.