volcano-sh / devices Goto Github PK

Device plugins for Volcano, e.g. GPU

License: Apache License 2.0

Go 95.81% Makefile 4.19%

devices's Issues

Support advertising specific GPU types as separate extended resource

Deploying different types of GPU on the same nodes is very common in production environment. However it is not supported in device plugin so far. We are going to support it and make it work with Volcano capacity scheduling so that use are able to configure different quota for different type of GPU.

Enable prow for this repo

Currently, we did not support prow for this repo which make it unconvience for contributors :)

create pod UnexpectedAdmissionError

For k8s v1.17.9 UnexpectedAdmissionError due to lack of pod update verbs

pod describe

  Warning  UnexpectedAdmissionError  10s   kubelet, amax-pcl  Update plugin resources failed due to rpc error: code = Unknown desc = failed to update pod annotation pods "pod1" is forbidden: User "system:serviceaccount:kube-system:volcano-device-plugin" cannot update resource "pods" in API group "" in the namespace "default", which is unexpected.

rbac need update verbs

  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list"]

Add MIG Device support

Adding support for MIG devices - the pull request is still WIP - only to review general idea
#20

Move related GPU code to pkg/gpu

As we're going to support different devices, so prefer to iolated the code by directory; and share the common util at package, e.g. pkg/common.

Should add volcano.sh/gpu-index annotation in pod example

I found the example missing specify volcano.sh/gpu-index annotation

If not specify volcano.sh/gpu-index annotation when want to allocate volcano.sh/gpu-memory, the pod will always fail to be created.

Failure of Device Plugin Communication with Kubernetes Kubelet

Description:

The Kubernetes kubelet service on the host is encountering intermittent failures in communicating with the device plugin, specifically regarding the volcano.sh/vgpu-number resource.

Description for Node

Allocatable:
  volcano.sh/gpu-memory:   0
  volcano.sh/gpu-number:   2
  volcano.sh/vgpu-number:  8

Kubelet Logs

Mar 19 14:31:09 dell-63 kubelet[1067]: E0319 14:31:09.987111    1067 endpoint.go:107] "listAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resourceName="volcano.sh/vgpu-number"
Mar 19 14:31:09 dell-63 kubelet[1067]: W0319 14:31:09.987121    1067 clientconn.go:1326] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/device-plugins/nvidia-gpu.sock /var/lib/kubelet/device-plugins/nvidia-gpu.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/device-plugins/nvidia-gpu.sock: connect: connection refused". Reconnecting...

Pod Events

Events:
  Type     Reason                    Age   From               Message
  ----     ------                    ----  ----               -------
  Normal   Scheduled                 12s   default-scheduler  Successfully assigned crater-jobs/gpu-pod11111 to dell-63
  Warning  UnexpectedAdmissionError  12s   kubelet            Allocate failed due to rpc error: code = Unavailable desc = error reading from server: EOF, which is unexpected

Plugin Logs

 I0319 14:51:12.329385       1 main.go:77] Loading NVML
 I0319 14:51:12.356024       1 main.go:91] Starting FS watcher.
 I0319 14:51:12.356095       1 main.go:98] Starting OS watcher.
 I0319 14:51:12.369951       1 main.go:116] Retreiving plugins.
 I0319 14:51:12.370235       1 register.go:101] into WatchAndRegister
 2024/03/19 14:51:12 Starting GRPC server for 'volcano.sh/vgpu-number'
 2024/03/19 14:51:12 Starting to serve 'volcano.sh/vgpu-number' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
 2024/03/19 14:51:12 Registered device plugin for 'volcano.sh/vgpu-number' with Kubelet
 I0319 14:51:12.397816       1 register.go:89] Reporting devices GPU-01020ff8-5d55-31c9-9d5d-e433bef1579d,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false:GPU-50dbcaf2-0289-c862-c69c-5c0a967021a1,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false: in 2024-03-19 14:51:12.397800845 +0000 UTC m=+0.077900234
 I0319 14:51:42.431339       1 register.go:89] Reporting devices GPU-01020ff8-5d55-31c9-9d5d-e433bef1579d,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false:GPU-50dbcaf2-0289-c862-c69c-5c0a967021a1,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false: in 2024-03-19 14:51:42.43132017 +0000 UTC m=+30.111419558
 I0319 14:52:12.577834       1 register.go:89] Reporting devices GPU-01020ff8-5d55-31c9-9d5d-e433bef1579d,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false:GPU-50dbcaf2-0289-c862-c69c-5c0a967021a1,4,11264,NVIDIA-NVIDIA GeForce RTX 2080 Ti,false: in 2024-03-19 14:52:12.577817165 +0000 UTC m=+60.257916570

Socket appears not to start successfully, without error messages?

请问volcano 支持将job 调度到 nvidia 的指定显卡型号的node 上吗

是否可以通过在 job 的spec 中添加 nvidia.com/a100 或者nvidia.com/gpu-type: a100 将job 调度到指定的显卡上

volcano device plugin请问是否可以在arm架构中编译源码？是否有开发arm架构的计划？

Warning UnexpectedAdmissionErro

运行pod报错 Warning UnexpectedAdmissionError 79s kubelet Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected @ @

Unexpected Admission Error

2024-03-07T10:55:34.101084725Z stderr F I0307 10:55:34.100796 1 register.go:88] Reporting devices GPU-1c5009f6-3637-cd7d-4955-5eaa038e563e,10,32768,NVIDIA-Tesla V100-SXM2-32GB,false:GPU-809c5c58-ad8f-c998-2447-4ac9befe0fdb,10,32768,NVIDIA-Tesla V100-SXM2-32GB,false:GPU-e1fde3ec-1842-57b1-5862-3585e22923d1,10,32768,NVIDIA-Tesla V100-SXM2-32GB,false:GPU-cd8f49bf-1607-360b-8bf3-3defa6a58bb0,10,32768,NVIDIA-Tesla V100-SXM2-32GB,false: in 2024-03-07 10:55:34.100782062 +0000 UTC m=+600.800399211
2024-03-07T10:55:47.977734126Z stderr F I0307 10:55:47.977559 1 plugin.go:296] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-e1fde3ec-1842-57b1-5862-3585e22923d1-3 GPU-809c5c58-ad8f-c998-2447-4ac9befe0fdb-2],}]
2024-03-07T10:55:48.374277561Z stderr F panic: runtime error: invalid memory address or nil pointer dereference
2024-03-07T10:55:48.374309502Z stderr F [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1092953]
2024-03-07T10:55:48.374313183Z stderr F
2024-03-07T10:55:48.374316238Z stderr F goroutine 234 [running]:
2024-03-07T10:55:48.374319213Z stderr F volcano.sh/k8s-device-plugin/pkg/plugin/vgpu.(*NvidiaDevicePlugin).Allocate(0xc0001bc280, {0x14cff80, 0xc00045c720}, 0xc0004427c0)
2024-03-07T10:55:48.374321927Z stderr F /go/src/volcano.sh/devices/pkg/plugin/vgpu/plugin.go:313 +0x353
2024-03-07T10:55:48.374324767Z stderr F k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1._DevicePlugin_Allocate_Handler({0x1292040?, 0xc0001bc280}, {0x14cff80, 0xc00045c720}, 0xc000516960, 0x0)
2024-03-07T10:55:48.374327831Z stderr F /go/pkg/mod/k8s.io/[email protected]/pkg/apis/deviceplugin/v1beta1/api.pb.go:1192 +0x170
2024-03-07T10:55:48.374330155Z stderr F google.golang.org/grpc.(*Server).processUnaryRPC(0xc000183040, {0x14d4e98, 0xc000322180}, 0xc000148c00, 0xc0001a6180, 0x1ce57f8, 0x0)
2024-03-07T10:55:48.374349455Z stderr F /go/pkg/mod/google.golang.org/[email protected]/server.go:1082 +0xcab
2024-03-07T10:55:48.374354779Z stderr F google.golang.org/grpc.(*Server).handleStream(0xc000183040, {0x14d4e98, 0xc000322180}, 0xc000148c00, 0x0)
2024-03-07T10:55:48.374357667Z stderr F /go/pkg/mod/google.golang.org/[email protected]/server.go:1405 +0xa13
2024-03-07T10:55:48.374360438Z stderr F google.golang.org/grpc.(*Server).serveStreams.func1.1()
2024-03-07T10:55:48.374373665Z stderr F /go/pkg/mod/google.golang.org/[email protected]/server.go:746 +0x98
2024-03-07T10:55:48.374376854Z stderr F created by google.golang.org/grpc.(*Server).serveStreams.func1
2024-03-07T10:55:48.374379632Z stderr F /go/pkg/mod/google.golang.org/[email protected]/server.go:744 +0xea

Facing the above error when tried to create a pod with vGPU enabled.

[enhance] support specify GPU number for pod resource request

Request from #11

Refer the GPU share user doc (https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_gpu_sharing.md).
Currently, only support specify GPU share memory, to specify GPU number is a limitation.

We need support volcano.sh/gpu-number just like nvidia gpu plugin to support specify GPU number for pod resource request

device plugin does not run on containerd

Is it possible to run the volcano device plugin on containerd? If so, how?

We got the message:

Loading NVML
Failed to initialize NVML: could not load NVML library.
If this is a GPU node, did you set the docker default runtime to nvidia?

ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100

This issue is an extension of #18

What happened:
Applying volcano-device-plugin on a server using 8*V100 GPU, but get volcano.sh/gpu-memory:0 when describe nodes:

Same situation did not occur when using T4 or P4.
Tracing kubelet logs, found following error message:

seems like sync message is too large.

What caused this bug:
volcano-device-plugin mock GPUs into a device list(every device in this list is considered as a 1MB memory block), so that different workloads can share one GPU through kubernetes device plugin mechanism. When large memory GPU such as V100 is implemented, the size of device list exceeds the bound, and ListAndWatch failed as a result.

Solutions:
The key is to minimize the size of the device list, so we can consider each device as a 10MB memory block and reform the whole bookkeeping process according to this assumption. This accuracy is enough for almost all production environments.

volcano.sh/vgpu-memory and volcano.sh/vgpu-number

Will the applied 12288MiB graphics memory be evenly allocated to two cards when volcano. sh/vgpu memory means "12288" and volcano. sh/vgpu number means "2"

vgpu 并发调度pod时，显存混乱

执行下面的命令，同时调度2个pod，一个分配24576M显存，一个分配600M显存，pod起来后进入容器使用nvidia-smi查看，发现两者的显存是反的，给容器ubuntu-container-24576分配了600M显存，给容器ubuntu-container-600分配了24576显存

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-1v-24576-1
spec:
  schedulerName: volcano
  containers:
    - name: ubuntu-container-24576
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          volcano.sh/vgpu-number: 1 # requesting 1 vGPUs
          volcano.sh/vgpu-memory: 24576
---
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-1v-600-1
spec:
  schedulerName: volcano
  containers:
    - name: ubuntu-container-600
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          volcano.sh/vgpu-number: 1 # requesting 1 vGPUs
          volcano.sh/vgpu-memory: 600
EOF

rpc error: code = Unknown desc = failed to find gpu id

Maybe is a bug

The yaml is

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mindx-dls-gpu
  namespace: vcjob
spec:
  minAvailable: 1
  schedulerName: volcano
  policies:
    - event: PodEvicted
      action: RestartJob
  maxRetry: 3
  queue: default
  tasks:
  - name: "default-1p"
    replicas: 1
    template:
      metadata:
        labels:
          app: tf
      spec:
        containers:
        - image: nvidia-train:v1
          imagePullPolicy: IfNotPresent
          name: cuda-container
          command:
          - "/bin/bash"
          - "-c"
          #- "chmod 777 -R /job;cd /job/code/ModelZoo_Resnet50_HC; bash train_start.sh"
          args: [ "while true; do sleep 3000000; done;"  ]
          resources:
            requests:
              volcano.sh/gpu-number: 1
            limits:
              volcano.sh/gpu-number: 1
          volumeMounts:
          - name: timezone
            mountPath: /etc/timezone
          - name: localtime
            mountPath: /etc/localtime
        nodeSelector:
          accelerator: nvidia-tesla-v100
        volumes:
        - name: timezone
          hostPath:
            path: /etc/timezone
        - name: localtime
          hostPath:
            path: /etc/timezone
  name:
        restartPolicy: OnFailure`

And I use "olcano.sh/gpu-memory" resource is error:

Nov 10 20:11:23 ubuntu560 kubelet[26515]: E1110 20:11:23.149895   26515 manager.go:374] Failed to allocate device plugin resource for pod 28bc8549-e3b9-40f6-8adb-7830f967d97b: rpc error: code = Unknown desc = failed to find gpu id
Nov 10 20:11:23 ubuntu560 kubelet[26515]: W1110 20:11:23.149941   26515 predicate.go:74] Failed to admit pod mindx-dls-gpu-default-1p-0_vcjob(28bc8549-e3b9-40f6-8adb-7830f967d97b) - Update plugin resources failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected.

env:

volcano:v1.0.1
volcano-deviceplugin:v1.0.0
os:ubuntu 18.04  amd64

so is volcano.sh/gpu-memory support?

gpu-memory is just for a gpu memery claim?

for now, “volcano.sh/gpu-memory” can essentially only set 3 environment variables based on https://github.com/volcano-sh/devices/blob/master/pkg/plugin/nvidia/server.go#L324 to tell user how much gpu memory he can use in current container?

for example, when set “volcano.sh/gpu-memory: 1024”, environment variables looks like below in container, in fact, process in this container can use at most 4096M gpu memory, even VOLCANO_GPU_ALLOCATED is 1024?

NVIDIA_VISIBLE_DEVICES: “0” 
VOLCANO_GPU_ALLOCATED: “1024” 
VOLCANO_GPU_TOTAL: “4096”

can anyone help me confirm if I understand correctly?

Can I compile the source code in the arm architecture for volcano device plugin? Is there a plan to develop an arm architecture

The tag in the containers field:

image: volcanosh/volcano-device-plugin:1.0.0-ubuntu20.04
can't be found on Docker Hub currently. Is it equivalent to the volcanosh/volcano-device-plugin:latest tag?

Additionally, after actually pulling the latest image, I found it only compatible with the x86 environment. Pods on ARM architecture machines will report:
standard_init_linux.go:220: exec user process caused "exec format error"
libcontainer: container start initialization failed: standard_init_linux.go:220: exec user process caused "exec format error"

add 2 or more allocatable devices to a pod

as the issue #1181
Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind feature
Currently, only one gpu device will be allocated, but in most scenarios, we hope two or more device on the same server or a few devices on servers can be allocated to a pod.
Thank u!

使用gpu-number 导致 schduler crash

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: ocr-job 
spec:
  minAvailable: 1
  schedulerName: volcano
  queue: default 
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: ocr
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        spec:
          containers:
            - image: ai-grpc-ocr:v1.4 
              name: ocr
              resources:
                requests:
                  volcano.sh/gpu-number: 1
                  #nvidia.com/gpu: 1 
                limits:
                  volcano.sh/gpu-number: 1
                  #nvidia.com/gpu: 1 
          restartPolicy: Never
    - replicas: 1
      name: ocr-2
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        spec:
          containers:
            - image: ai-grpc-ocr:v1.4
              name: ocr
              resources:
                requests:
                  volcano.sh/gpu-number: 1
                  #nvidia.com/gpu: 1 
                limits:
                  volcano.sh/gpu-number: 1
                  #nvidia.com/gpu: 1 
          restartPolicy: Never

log

$ k get no 
NAME          STATUS   ROLES    AGE    VERSION
10.122.2.14   Ready    <none>   42d    v1.26.1
10.122.2.26   Ready    <none>   154m   v1.26.1
10.122.2.37   Ready    <none>   44m    v1.26.1
$ k get po                                      
NAME                                   READY   STATUS             RESTARTS      AGE
ocr-job-ocr-0                          0/1     Pending            0             4m33s
ocr-job-ocr-2-0                        0/1     Pending            0             4m33s
volcano-admission-7f76fc8cf4-rcp85     1/1     Running            0             35d
volcano-admission-init-785w6           0/1     Completed          0             35d
volcano-controllers-6875c95bd7-zs49k   1/1     Running            0             35d
volcano-scheduler-6dcf84d54d-gcwxm     0/1     CrashLoopBackOff   9 (80s ago)   58m

$ k get po       
NAME                                   READY   STATUS             RESTARTS      AGE
ocr-job-ocr-0                          0/1     Pending            0             41s
ocr-job-ocr-2-0                        0/1     Pending            0             41s
volcano-admission-7f76fc8cf4-rcp85     1/1     Running            0             35d
volcano-admission-init-785w6           0/1     Completed          0             35d
volcano-controllers-6875c95bd7-zs49k   1/1     Running            0             35d
volcano-scheduler-6dcf84d54d-zg4d2     0/1     CrashLoopBackOff   2 (25s ago)   4m20s
 I0816 12:30:34.061174       1 allocate.go:180] There are <3> nodes for Job <volcano-system/ocr-job-11507a57-1b68-46ad-83bf-38e0c2d76f99>
I0816 12:30:34.061251       1 predicate_helper.go:74] Predicates failed for task <volcano-system/ocr-job-ocr-0> on node <10.122.2.14>: task volcano-system/ocr-job-ocr-0 on node 10.122.2.14 fit failed: Insufficient volcano.sh/gpu-number
E0816 12:30:34.061385       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 334 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1caa960?, 0x32ac650})
	/go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00102cf70?})
	/go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x1caa960, 0x32ac650})
	/usr/local/go/src/runtime/panic.go:884 +0x212
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.getDevicesIdleGPUs(...)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:64
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.predicateGPUbyNumber(0xc000dadac0?, 0x0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:166 +0x41
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.checkNodeGPUNumberPredicate(0xc000c68cf0?, 0x0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:140 +0x3f
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.(*GPUDevices).FilterNode(0x1c8bec0?, 0xc000dadac0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/device_info.go:161 +0x157
volcano.sh/volcano/pkg/scheduler/plugins/predicates.(*predicatesPlugin).OnSessionOpen.func4(0xc000848be0, 0xc0004e0180)
	/go/src/volcano.sh/volcano/pkg/scheduler/plugins/predicates/predicates.go:522 +0x16e4
volcano.sh/volcano/pkg/scheduler/framework.(*Session).PredicateFn(0xc001094000, 0xc00100df80?, 0x0?)
	/go/src/volcano.sh/volcano/pkg/scheduler/framework/session_plugins.go:615 +0x1ce
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).Execute.func1(0xc000848be0, 0xc0004e0180)
	/go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:106 +0x1cb
volcano.sh/volcano/pkg/scheduler/util.(*predicateHelper).PredicateNodes.func1(0xc0002045a0?)
	/go/src/volcano.sh/volcano/pkg/scheduler/util/predicate_helper.go:73 +0x3a2
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
	/go/src/volcano.sh/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:90 +0x106
created by k8s.io/client-go/util/workqueue.ParallelizeUntil
	/go/src/volcano.sh/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:76 +0x1d7
I0816 12:30:34.061465       1 statement.go:352] Discarding operations ...
I0816 12:30:34.061494       1 allocate.go:135] Try to allocate resource to Jobs in Queue <default>
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x15ab261]

goroutine 334 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00102cf70?})
	/go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xd7
panic({0x1caa960, 0x32ac650})
	/usr/local/go/src/runtime/panic.go:884 +0x212
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.getDevicesIdleGPUs(...)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:64
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.predicateGPUbyNumber(0xc000dadac0?, 0x0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:166 +0x41
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.checkNodeGPUNumberPredicate(0xc000c68cf0?, 0x0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:140 +0x3f
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.(*GPUDevices).FilterNode(0x1c8bec0?, 0xc000dadac0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/device_info.go:161 +0x157
volcano.sh/volcano/pkg/scheduler/plugins/predicates.(*predicatesPlugin).OnSessionOpen.func4(0xc000848be0, 0xc0004e0180)
	/go/src/volcano.sh/volcano/pkg/scheduler/plugins/predicates/predicates.go:522 +0x16e4
volcano.sh/volcano/pkg/scheduler/framework.(*Session).PredicateFn(0xc001094000, 0xc00100df80?, 0x0?)
	/go/src/volcano.sh/volcano/pkg/scheduler/framework/session_plugins.go:615 +0x1ce
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).Execute.func1(0xc000848be0, 0xc0004e0180)
	/go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:106 +0x1cb
volcano.sh/volcano/pkg/scheduler/util.(*predicateHelper).PredicateNodes.func1(0xc0002045a0?)
	/go/src/volcano.sh/volcano/pkg/scheduler/util/predicate_helper.go:73 +0x3a2
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
	/go/src/volcano.sh/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:90 +0x106
created by k8s.io/client-go/util/workqueue.ParallelizeUntil
	/go/src/volcano.sh/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:76 +0x1d7
..

3个节点， 10.122.2.26 / 10.122.2.37 是 gpu 机器，；10.122.2.14是 cpu 机器。
切换成 nvidia.com/gpu 这个资源，直接调度失败。目前原因未知

是7月份部署的。 latest 镜像

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yamlkubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

gpu number无法使用

Improve the logic of finding candidate pod in Allocate RPC

Currently in the device plugin Allocate RPC, we need to find the candidate pod according to the container in the request.
If there are multiple gpu containers in one pod, obviously there will be logic problems when finding the candidate pod.

func (m *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
	var reqCount uint
	for _, req := range reqs.ContainerRequests {
		reqCount += uint(len(req.DevicesIDs))
	}

	responses := pluginapi.AllocateResponse{}

	firstContainerReq := reqs.ContainerRequests[0]
	firstContainerReqDeviceCount := uint(len(firstContainerReq.DevicesIDs))

	availablePods := podSlice{}
	pendingPods, err := m.kubeInteractor.GetPendingPodsOnNode()
	if err != nil {
		return nil, err
	}
	for _, pod := range pendingPods {
		current := pod
		if IsGPURequiredPod(&current) && !IsGPUAssignedPod(&current) && !IsShouldDeletePod(&current) {
			availablePods = append(availablePods, &current)
		}
	}

	sort.Sort(availablePods)

	var candidatePod *v1.Pod
	for _, pod := range availablePods {
		for i, c := range pod.Spec.Containers {
			if !IsGPURequiredContainer(&c) {
				continue
			}

			if GetGPUResourceOfContainer(&pod.Spec.Containers[i]) == firstContainerReqDeviceCount {
				klog.Infof("Got candidate Pod %s(%s), the device count is: %d", pod.UID, c.Name, firstContainerReqDeviceCount)
				candidatePod = pod
				goto Allocate
			}
		}
	}

        ....

The Docker image tag in volcano-device-plugin.yaml is incorrect, and it only has an x86 version, there is no arm64 version available.

The tag in the containers field:

image: volcanosh/volcano-device-plugin:1.0.0-ubuntu20.04

can't be found on Docker Hub currently. Is it equivalent to the volcanosh/volcano-device-plugin:latest tag?

run pod failed

When a pod with the volcano resource is run, it crashes

https://github.com/volcano-sh/devices/blob/master/volcano-vgpu-device-plugin.yml

master code can't run

2023/08/14 12:05:25 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:25 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
2023/08/14 12:05:26 Could not start device plugin for 'volcano.sh/gpu-memory': listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 Plugin Volcano-GPU-Plugin failed to start: listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:26 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
2023/08/14 12:05:26 Could not start device plugin for 'volcano.sh/gpu-memory': listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 Plugin Volcano-GPU-Plugin failed to start: listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:26 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
2023/08/14 12:05:26 Could not start device plugin for 'volcano.sh/gpu-memory': listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 Plugin Volcano-GPU-Plugin failed to start: listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:26 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
2023/08/14 12:05:27 Could not start device plugin for 'volcano.sh/gpu-memory': listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:27 Plugin Volcano-GPU-Plugin failed to start: listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:27 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:27 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start

The `args` configuration for the `containers` in the `volcano-device-plugin.yaml` is incorrect.

When I attempted to install the plugin, the pod gave the following error:

flag provided but not defined: -gpu-strategy
Usage of volcano-device-plugin:

I tried modifying the args in volcano-device-plugin.yaml from ["--gpu-strategy=share", "--gpu-memory-factor=1"] to ["---gpu-strategy=share", "---gpu-memory-factor=1"]. However, the error message then became:

flag provided but not defined: ----gpu-strategy
Usage of volcano-device-plugin:

In the end, I had to remove the args to successfully create the pod.

device-plugin painc

I0615 08:06:10.942719       1 plugin.go:382] Allocate Response [&ContainerAllocateResponse{Envs:map[string]string{CUDA_DEVICE_MEMORY_LIMIT_0: 1024m,CUDA_DEVICE_MEMORY_SHARED_CACHE: /tmp/vgpu/6b5a834e-6fec-47d6-b629-0468cd18ba69.cache,NVIDIA_VISIBLE_DEVICES: GPU-d8794152-5506-fe60-be38-c6ff3d35dbf4,},Mounts:[]*Mount{&Mount{ContainerPath:/usr/local/vgpu/libvgpu.so,HostPath:/usr/local/vgpu/libvgpu.so,ReadOnly:true,},&Mount{ContainerPath:/etc/ld.so.preload,HostPath:/usr/local/vgpu/ld.so.preload,ReadOnly:true,},&Mount{ContainerPath:/tmp/vgpu,HostPath:/tmp/vgpu/containers/1a7defed-9fff-4feb-8921-45cc7ea253f7_vgpu2,ReadOnly:false,},&Mount{ContainerPath:/tmp/vgpulock,HostPath:/tmp/vgpulock,ReadOnly:false,},},Devices:[]*DeviceSpec{},Annotations:map[string]string{},}]
I0615 08:06:11.002096       1 util.go:229] TrySuccess:
I0615 08:06:11.002123       1 util.go:235] AllDevicesAllocateSuccess releasing lock
I0615 08:06:11.338349       1 plugin.go:309] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-d8794152-5506-fe60-be38-c6ff3d35dbf4-5],}]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1092953]

goroutine 68 [running]:
volcano.sh/k8s-device-plugin/pkg/plugin/vgpu4pd.(*NvidiaDevicePlugin).Allocate(0xc00038ec80, {0x14cfee0, 0xc0003ee510}, 0xc0005eca00)
	/go/src/volcano.sh/devices/pkg/plugin/vgpu4pd/plugin.go:326 +0x353
k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1._DevicePlugin_Allocate_Handler({0x12920a0?, 0xc00038ec80}, {0x14cfee0, 0xc0003ee510}, 0xc00062e060, 0x0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/apis/deviceplugin/v1beta1/api.pb.go:1192 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0003a41a0, {0x14d4df8, 0xc0004cc000}, 0xc0000b2100, 0xc000367aa0, 0x1ce57f8, 0x0)
	/go/pkg/mod/google.golang.org/[email protected]/server.go:1082 +0xcab
google.golang.org/grpc.(*Server).handleStream(0xc0003a41a0, {0x14d4df8, 0xc0004cc000}, 0xc0000b2100, 0x0)
	/go/pkg/mod/google.golang.org/[email protected]/server.go:1405 +0xa13
google.golang.org/grpc.(*Server).serveStreams.func1.1()
	/go/pkg/mod/google.golang.org/[email protected]/server.go:746 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/pkg/mod/google.golang.org/[email protected]/server.go:744 +0xea

when use volcano.sh/gpu-number: 1，why the pod has all GPU？

Maybe a bug

when i request i gpu for k8s (the yaml in #10), I find the volcano-deviceplugin give all the node gpus to my pod.

the picture is flowed(in /dev/):

I donot know why

k8s:1.17.3
volcano:v1.0.1
volcano-deviceplugin:1.0.0
docker:18.06.3-ce
os : ubuntu18.04
arm : x86

volcano.sh/gpu-memory: 0

NVIDIA Telas V100 * 8
cuda10.1 driver
volcano-device-plugin
能够获取到volcano.sh/gpu-number: 8
获取不到显存 volcano.sh/gpu-memory: 0
这是什么原因？