volcano-sh / devices Goto Github PK

View Code? Open in Web Editor NEW

90.0 90.0 38.0 234 KB

Device plugins for Volcano, e.g. GPU

License: Apache License 2.0

Go 97.58% Makefile 2.42%

devices's People

Stargazers

Watchers

devices's Issues

create pod UnexpectedAdmissionError

For k8s v1.17.9 UnexpectedAdmissionError due to lack of pod update verbs

pod describe

  Warning  UnexpectedAdmissionError  10s   kubelet, amax-pcl  Update plugin resources failed due to rpc error: code = Unknown desc = failed to update pod annotation pods "pod1" is forbidden: User "system:serviceaccount:kube-system:volcano-device-plugin" cannot update resource "pods" in API group "" in the namespace "default", which is unexpected.

rbac need update verbs

  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list"]

Add MIG Device support

Adding support for MIG devices - the pull request is still WIP - only to review general idea
#20

The Docker image tag in volcano-device-plugin.yaml is incorrect, and it only has an x86 version, there is no arm64 version available.

The tag in the containers field:

image: volcanosh/volcano-device-plugin:1.0.0-ubuntu20.04

can't be found on Docker Hub currently. Is it equivalent to the volcanosh/volcano-device-plugin:latest tag?

Additionally, after actually pulling the latest image, I found it only compatible with the x86 environment. Pods on ARM architecture machines will report:
standard_init_linux.go:220: exec user process caused "exec format error"
libcontainer: container start initialization failed: standard_init_linux.go:220: exec user process caused "exec format error"

What version of cuda does vgpu support?

add 2 or more allocatable devices to a pod

as the issue #1181
Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind feature
Currently, only one gpu device will be allocated, but in most scenarios, we hope two or more device on the same server or a few devices on servers can be allocated to a pod.
Thank u!

device-plugin painc

I0615 08:06:10.942719       1 plugin.go:382] Allocate Response [&ContainerAllocateResponse{Envs:map[string]string{CUDA_DEVICE_MEMORY_LIMIT_0: 1024m,CUDA_DEVICE_MEMORY_SHARED_CACHE: /tmp/vgpu/6b5a834e-6fec-47d6-b629-0468cd18ba69.cache,NVIDIA_VISIBLE_DEVICES: GPU-d8794152-5506-fe60-be38-c6ff3d35dbf4,},Mounts:[]*Mount{&Mount{ContainerPath:/usr/local/vgpu/libvgpu.so,HostPath:/usr/local/vgpu/libvgpu.so,ReadOnly:true,},&Mount{ContainerPath:/etc/ld.so.preload,HostPath:/usr/local/vgpu/ld.so.preload,ReadOnly:true,},&Mount{ContainerPath:/tmp/vgpu,HostPath:/tmp/vgpu/containers/1a7defed-9fff-4feb-8921-45cc7ea253f7_vgpu2,ReadOnly:false,},&Mount{ContainerPath:/tmp/vgpulock,HostPath:/tmp/vgpulock,ReadOnly:false,},},Devices:[]*DeviceSpec{},Annotations:map[string]string{},}]
I0615 08:06:11.002096       1 util.go:229] TrySuccess:
I0615 08:06:11.002123       1 util.go:235] AllDevicesAllocateSuccess releasing lock
I0615 08:06:11.338349       1 plugin.go:309] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-d8794152-5506-fe60-be38-c6ff3d35dbf4-5],}]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1092953]

goroutine 68 [running]:
volcano.sh/k8s-device-plugin/pkg/plugin/vgpu4pd.(*NvidiaDevicePlugin).Allocate(0xc00038ec80, {0x14cfee0, 0xc0003ee510}, 0xc0005eca00)
	/go/src/volcano.sh/devices/pkg/plugin/vgpu4pd/plugin.go:326 +0x353
k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1._DevicePlugin_Allocate_Handler({0x12920a0?, 0xc00038ec80}, {0x14cfee0, 0xc0003ee510}, 0xc00062e060, 0x0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/apis/deviceplugin/v1beta1/api.pb.go:1192 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0003a41a0, {0x14d4df8, 0xc0004cc000}, 0xc0000b2100, 0xc000367aa0, 0x1ce57f8, 0x0)
	/go/pkg/mod/google.golang.org/[email protected]/server.go:1082 +0xcab
google.golang.org/grpc.(*Server).handleStream(0xc0003a41a0, {0x14d4df8, 0xc0004cc000}, 0xc0000b2100, 0x0)
	/go/pkg/mod/google.golang.org/[email protected]/server.go:1405 +0xa13
google.golang.org/grpc.(*Server).serveStreams.func1.1()
	/go/pkg/mod/google.golang.org/[email protected]/server.go:746 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/pkg/mod/google.golang.org/[email protected]/server.go:744 +0xea

Improve the logic of finding candidate pod in Allocate RPC

Currently in the device plugin Allocate RPC, we need to find the candidate pod according to the container in the request.
If there are multiple gpu containers in one pod, obviously there will be logic problems when finding the candidate pod.

func (m *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
	var reqCount uint
	for _, req := range reqs.ContainerRequests {
		reqCount += uint(len(req.DevicesIDs))
	}

	responses := pluginapi.AllocateResponse{}

	firstContainerReq := reqs.ContainerRequests[0]
	firstContainerReqDeviceCount := uint(len(firstContainerReq.DevicesIDs))

	availablePods := podSlice{}
	pendingPods, err := m.kubeInteractor.GetPendingPodsOnNode()
	if err != nil {
		return nil, err
	}
	for _, pod := range pendingPods {
		current := pod
		if IsGPURequiredPod(&current) && !IsGPUAssignedPod(&current) && !IsShouldDeletePod(&current) {
			availablePods = append(availablePods, &current)
		}
	}

	sort.Sort(availablePods)

	var candidatePod *v1.Pod
	for _, pod := range availablePods {
		for i, c := range pod.Spec.Containers {
			if !IsGPURequiredContainer(&c) {
				continue
			}

			if GetGPUResourceOfContainer(&pod.Spec.Containers[i]) == firstContainerReqDeviceCount {
				klog.Infof("Got candidate Pod %s(%s), the device count is: %d", pod.UID, c.Name, firstContainerReqDeviceCount)
				candidatePod = pod
				goto Allocate
			}
		}
	}

        ....

Move related GPU code to pkg/gpu

As we're going to support different devices, so prefer to iolated the code by directory; and share the common util at package, e.g. pkg/common.

使用gpu-number 导致 schduler crash

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: ocr-job 
spec:
  minAvailable: 1
  schedulerName: volcano
  queue: default 
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: ocr
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        spec:
          containers:
            - image: ai-grpc-ocr:v1.4 
              name: ocr
              resources:
                requests:
                  volcano.sh/gpu-number: 1
                  #nvidia.com/gpu: 1 
                limits:
                  volcano.sh/gpu-number: 1
                  #nvidia.com/gpu: 1 
          restartPolicy: Never
    - replicas: 1
      name: ocr-2
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        spec:
          containers:
            - image: ai-grpc-ocr:v1.4
              name: ocr
              resources:
                requests:
                  volcano.sh/gpu-number: 1
                  #nvidia.com/gpu: 1 
                limits:
                  volcano.sh/gpu-number: 1
                  #nvidia.com/gpu: 1 
          restartPolicy: Never

log

$ k get no 
NAME          STATUS   ROLES    AGE    VERSION
10.122.2.14   Ready    <none>   42d    v1.26.1
10.122.2.26   Ready    <none>   154m   v1.26.1
10.122.2.37   Ready    <none>   44m    v1.26.1
$ k get po                                      
NAME                                   READY   STATUS             RESTARTS      AGE
ocr-job-ocr-0                          0/1     Pending            0             4m33s
ocr-job-ocr-2-0                        0/1     Pending            0             4m33s
volcano-admission-7f76fc8cf4-rcp85     1/1     Running            0             35d
volcano-admission-init-785w6           0/1     Completed          0             35d
volcano-controllers-6875c95bd7-zs49k   1/1     Running            0             35d
volcano-scheduler-6dcf84d54d-gcwxm     0/1     CrashLoopBackOff   9 (80s ago)   58m

$ k get po       
NAME                                   READY   STATUS             RESTARTS      AGE
ocr-job-ocr-0                          0/1     Pending            0             41s
ocr-job-ocr-2-0                        0/1     Pending            0             41s
volcano-admission-7f76fc8cf4-rcp85     1/1     Running            0             35d
volcano-admission-init-785w6           0/1     Completed          0             35d
volcano-controllers-6875c95bd7-zs49k   1/1     Running            0             35d
volcano-scheduler-6dcf84d54d-zg4d2     0/1     CrashLoopBackOff   2 (25s ago)   4m20s
 I0816 12:30:34.061174       1 allocate.go:180] There are <3> nodes for Job <volcano-system/ocr-job-11507a57-1b68-46ad-83bf-38e0c2d76f99>
I0816 12:30:34.061251       1 predicate_helper.go:74] Predicates failed for task <volcano-system/ocr-job-ocr-0> on node <10.122.2.14>: task volcano-system/ocr-job-ocr-0 on node 10.122.2.14 fit failed: Insufficient volcano.sh/gpu-number
E0816 12:30:34.061385       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 334 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1caa960?, 0x32ac650})
	/go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00102cf70?})
	/go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x1caa960, 0x32ac650})
	/usr/local/go/src/runtime/panic.go:884 +0x212
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.getDevicesIdleGPUs(...)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:64
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.predicateGPUbyNumber(0xc000dadac0?, 0x0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:166 +0x41
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.checkNodeGPUNumberPredicate(0xc000c68cf0?, 0x0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:140 +0x3f
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.(*GPUDevices).FilterNode(0x1c8bec0?, 0xc000dadac0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/device_info.go:161 +0x157
volcano.sh/volcano/pkg/scheduler/plugins/predicates.(*predicatesPlugin).OnSessionOpen.func4(0xc000848be0, 0xc0004e0180)
	/go/src/volcano.sh/volcano/pkg/scheduler/plugins/predicates/predicates.go:522 +0x16e4
volcano.sh/volcano/pkg/scheduler/framework.(*Session).PredicateFn(0xc001094000, 0xc00100df80?, 0x0?)
	/go/src/volcano.sh/volcano/pkg/scheduler/framework/session_plugins.go:615 +0x1ce
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).Execute.func1(0xc000848be0, 0xc0004e0180)
	/go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:106 +0x1cb
volcano.sh/volcano/pkg/scheduler/util.(*predicateHelper).PredicateNodes.func1(0xc0002045a0?)
	/go/src/volcano.sh/volcano/pkg/scheduler/util/predicate_helper.go:73 +0x3a2
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
	/go/src/volcano.sh/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:90 +0x106
created by k8s.io/client-go/util/workqueue.ParallelizeUntil
	/go/src/volcano.sh/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:76 +0x1d7
I0816 12:30:34.061465       1 statement.go:352] Discarding operations ...
I0816 12:30:34.061494       1 allocate.go:135] Try to allocate resource to Jobs in Queue <default>
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x15ab261]

goroutine 334 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00102cf70?})
	/go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xd7
panic({0x1caa960, 0x32ac650})
	/usr/local/go/src/runtime/panic.go:884 +0x212
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.getDevicesIdleGPUs(...)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:64
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.predicateGPUbyNumber(0xc000dadac0?, 0x0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:166 +0x41
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.checkNodeGPUNumberPredicate(0xc000c68cf0?, 0x0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:140 +0x3f
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.(*GPUDevices).FilterNode(0x1c8bec0?, 0xc000dadac0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/device_info.go:161 +0x157
volcano.sh/volcano/pkg/scheduler/plugins/predicates.(*predicatesPlugin).OnSessionOpen.func4(0xc000848be0, 0xc0004e0180)
	/go/src/volcano.sh/volcano/pkg/scheduler/plugins/predicates/predicates.go:522 +0x16e4
volcano.sh/volcano/pkg/scheduler/framework.(*Session).PredicateFn(0xc001094000, 0xc00100df80?, 0x0?)
	/go/src/volcano.sh/volcano/pkg/scheduler/framework/session_plugins.go:615 +0x1ce
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).Execute.func1(0xc000848be0, 0xc0004e0180)
	/go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:106 +0x1cb
volcano.sh/volcano/pkg/scheduler/util.(*predicateHelper).PredicateNodes.func1(0xc0002045a0?)
	/go/src/volcano.sh/volcano/pkg/scheduler/util/predicate_helper.go:73 +0x3a2
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
	/go/src/volcano.sh/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:90 +0x106
created by k8s.io/client-go/util/workqueue.ParallelizeUntil
	/go/src/volcano.sh/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:76 +0x1d7
..

3个节点， 10.122.2.26 / 10.122.2.37 是 gpu 机器，；10.122.2.14是 cpu 机器。
切换成 nvidia.com/gpu 这个资源，直接调度失败。目前原因未知

是7月份部署的。 latest 镜像

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yamlkubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

when use volcano.sh/gpu-number: 1，why the pod has all GPU？

Maybe a bug

when i request i gpu for k8s (the yaml in #10), I find the volcano-deviceplugin give all the node gpus to my pod.

the picture is flowed(in /dev/):

I donot know why

k8s:1.17.3
volcano:v1.0.1
volcano-deviceplugin:1.0.0
docker:18.06.3-ce
os : ubuntu18.04
arm : x86

volcano.sh/vgpu-memory and volcano.sh/vgpu-number

Will the applied 12288MiB graphics memory be evenly allocated to two cards when volcano. sh/vgpu memory means "12288" and volcano. sh/vgpu number means "2"

Should add volcano.sh/gpu-index annotation in pod example

I found the example missing specify volcano.sh/gpu-index annotation

If not specify volcano.sh/gpu-index annotation when want to allocate volcano.sh/gpu-memory, the pod will always fail to be created.

[enhance] support specify GPU number for pod resource request

Request from #11

Refer the GPU share user doc (https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_gpu_sharing.md).
Currently, only support specify GPU share memory, to specify GPU number is a limitation.

We need support volcano.sh/gpu-number just like nvidia gpu plugin to support specify GPU number for pod resource request

volcano device plugin请问是否可以在arm架构中编译源码？是否有开发arm架构的计划？

The `args` configuration for the `containers` in the `volcano-device-plugin.yaml` is incorrect.

When I attempted to install the plugin, the pod gave the following error:

flag provided but not defined: -gpu-strategy
Usage of volcano-device-plugin:

I tried modifying the args in volcano-device-plugin.yaml from ["--gpu-strategy=share", "--gpu-memory-factor=1"] to ["---gpu-strategy=share", "---gpu-memory-factor=1"]. However, the error message then became:

flag provided but not defined: ----gpu-strategy
Usage of volcano-device-plugin:

In the end, I had to remove the args to successfully create the pod.

gpu-memory is just for a gpu memery claim?

for now, “volcano.sh/gpu-memory” can essentially only set 3 environment variables based on https://github.com/volcano-sh/devices/blob/master/pkg/plugin/nvidia/server.go#L324 to tell user how much gpu memory he can use in current container?

for example, when set “volcano.sh/gpu-memory: 1024”, environment variables looks like below in container, in fact, process in this container can use at most 4096M gpu memory, even VOLCANO_GPU_ALLOCATED is 1024?

NVIDIA_VISIBLE_DEVICES: “0” 
VOLCANO_GPU_ALLOCATED: “1024” 
VOLCANO_GPU_TOTAL: “4096”

can anyone help me confirm if I understand correctly?

gpu number无法使用

ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100

This issue is an extension of #18

What happened:
Applying volcano-device-plugin on a server using 8*V100 GPU, but get volcano.sh/gpu-memory:0 when describe nodes:

Same situation did not occur when using T4 or P4.
Tracing kubelet logs, found following error message:

seems like sync message is too large.

What caused this bug:
volcano-device-plugin mock GPUs into a device list(every device in this list is considered as a 1MB memory block), so that different workloads can share one GPU through kubernetes device plugin mechanism. When large memory GPU such as V100 is implemented, the size of device list exceeds the bound, and ListAndWatch failed as a result.

Solutions:
The key is to minimize the size of the device list, so we can consider each device as a 10MB memory block and reform the whole bookkeeping process according to this assumption. This accuracy is enough for almost all production environments.

Enable prow for this repo

Currently, we did not support prow for this repo which make it unconvience for contributors :)

master code can't run

2023/08/14 12:05:25 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:25 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
2023/08/14 12:05:26 Could not start device plugin for 'volcano.sh/gpu-memory': listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 Plugin Volcano-GPU-Plugin failed to start: listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:26 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
2023/08/14 12:05:26 Could not start device plugin for 'volcano.sh/gpu-memory': listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 Plugin Volcano-GPU-Plugin failed to start: listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:26 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
2023/08/14 12:05:26 Could not start device plugin for 'volcano.sh/gpu-memory': listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 Plugin Volcano-GPU-Plugin failed to start: listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:26 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
2023/08/14 12:05:27 Could not start device plugin for 'volcano.sh/gpu-memory': listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:27 Plugin Volcano-GPU-Plugin failed to start: listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:27 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:27 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start

run pod failed

When a pod with the volcano resource is run, it crashes

https://github.com/volcano-sh/devices/blob/master/volcano-vgpu-device-plugin.yml

Warning UnexpectedAdmissionErro

运行pod报错 Warning UnexpectedAdmissionError 79s kubelet Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected @ @

Can I compile the source code in the arm architecture for volcano device plugin? Is there a plan to develop an arm architecture

The tag in the containers field:

image: volcanosh/volcano-device-plugin:1.0.0-ubuntu20.04
can't be found on Docker Hub currently. Is it equivalent to the volcanosh/volcano-device-plugin:latest tag?

rpc error: code = Unknown desc = failed to find gpu id

Maybe is a bug

The yaml is

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mindx-dls-gpu
  namespace: vcjob
spec:
  minAvailable: 1
  schedulerName: volcano
  policies:
    - event: PodEvicted
      action: RestartJob
  maxRetry: 3
  queue: default
  tasks:
  - name: "default-1p"
    replicas: 1
    template:
      metadata:
        labels:
          app: tf
      spec:
        containers:
        - image: nvidia-train:v1
          imagePullPolicy: IfNotPresent
          name: cuda-container
          command:
          - "/bin/bash"
          - "-c"
          #- "chmod 777 -R /job;cd /job/code/ModelZoo_Resnet50_HC; bash train_start.sh"
          args: [ "while true; do sleep 3000000; done;"  ]
          resources:
            requests:
              volcano.sh/gpu-number: 1
            limits:
              volcano.sh/gpu-number: 1
          volumeMounts:
          - name: timezone
            mountPath: /etc/timezone
          - name: localtime
            mountPath: /etc/localtime
        nodeSelector:
          accelerator: nvidia-tesla-v100
        volumes:
        - name: timezone
          hostPath:
            path: /etc/timezone
        - name: localtime
          hostPath:
            path: /etc/timezone
  name:
        restartPolicy: OnFailure`

And I use "olcano.sh/gpu-memory" resource is error:

Nov 10 20:11:23 ubuntu560 kubelet[26515]: E1110 20:11:23.149895   26515 manager.go:374] Failed to allocate device plugin resource for pod 28bc8549-e3b9-40f6-8adb-7830f967d97b: rpc error: code = Unknown desc = failed to find gpu id
Nov 10 20:11:23 ubuntu560 kubelet[26515]: W1110 20:11:23.149941   26515 predicate.go:74] Failed to admit pod mindx-dls-gpu-default-1p-0_vcjob(28bc8549-e3b9-40f6-8adb-7830f967d97b) - Update plugin resources failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected.

env:

volcano:v1.0.1
volcano-deviceplugin:v1.0.0
os:ubuntu 18.04  amd64

so is volcano.sh/gpu-memory support?

volcano.sh/gpu-memory: 0

NVIDIA Telas V100 * 8
cuda10.1 driver
volcano-device-plugin
能够获取到volcano.sh/gpu-number: 8
获取不到显存 volcano.sh/gpu-memory: 0
这是什么原因？

volcano-sh / devices Goto Github PK

devices's People

Stargazers

Watchers

Forkers

devices's Issues

pod describe

rbac need update verbs

Recommend Projects

Recommend Topics

Recommend Org