Giter Site home page Giter Site logo

devices's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

devices's Issues

create pod UnexpectedAdmissionError

For k8s v1.17.9 UnexpectedAdmissionError due to lack of pod update verbs

pod describe

  Warning  UnexpectedAdmissionError  10s   kubelet, amax-pcl  Update plugin resources failed due to rpc error: code = Unknown desc = failed to update pod annotation pods "pod1" is forbidden: User "system:serviceaccount:kube-system:volcano-device-plugin" cannot update resource "pods" in API group "" in the namespace "default", which is unexpected.

rbac need update verbs

  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list"]

The Docker image tag in volcano-device-plugin.yaml is incorrect, and it only has an x86 version, there is no arm64 version available.

The tag in the containers field:

  • image: volcanosh/volcano-device-plugin:1.0.0-ubuntu20.04

can't be found on Docker Hub currently. Is it equivalent to the volcanosh/volcano-device-plugin:latest tag?

Additionally, after actually pulling the latest image, I found it only compatible with the x86 environment. Pods on ARM architecture machines will report:
standard_init_linux.go:220: exec user process caused "exec format error"
libcontainer: container start initialization failed: standard_init_linux.go:220: exec user process caused "exec format error"

add 2 or more allocatable devices to a pod

as the issue #1181
Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind feature
Currently, only one gpu device will be allocated, but in most scenarios, we hope two or more device on the same server or a few devices on servers can be allocated to a pod.
Thank u!

device-plugin painc

I0615 08:06:10.942719       1 plugin.go:382] Allocate Response [&ContainerAllocateResponse{Envs:map[string]string{CUDA_DEVICE_MEMORY_LIMIT_0: 1024m,CUDA_DEVICE_MEMORY_SHARED_CACHE: /tmp/vgpu/6b5a834e-6fec-47d6-b629-0468cd18ba69.cache,NVIDIA_VISIBLE_DEVICES: GPU-d8794152-5506-fe60-be38-c6ff3d35dbf4,},Mounts:[]*Mount{&Mount{ContainerPath:/usr/local/vgpu/libvgpu.so,HostPath:/usr/local/vgpu/libvgpu.so,ReadOnly:true,},&Mount{ContainerPath:/etc/ld.so.preload,HostPath:/usr/local/vgpu/ld.so.preload,ReadOnly:true,},&Mount{ContainerPath:/tmp/vgpu,HostPath:/tmp/vgpu/containers/1a7defed-9fff-4feb-8921-45cc7ea253f7_vgpu2,ReadOnly:false,},&Mount{ContainerPath:/tmp/vgpulock,HostPath:/tmp/vgpulock,ReadOnly:false,},},Devices:[]*DeviceSpec{},Annotations:map[string]string{},}]
I0615 08:06:11.002096       1 util.go:229] TrySuccess:
I0615 08:06:11.002123       1 util.go:235] AllDevicesAllocateSuccess releasing lock
I0615 08:06:11.338349       1 plugin.go:309] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-d8794152-5506-fe60-be38-c6ff3d35dbf4-5],}]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1092953]

goroutine 68 [running]:
volcano.sh/k8s-device-plugin/pkg/plugin/vgpu4pd.(*NvidiaDevicePlugin).Allocate(0xc00038ec80, {0x14cfee0, 0xc0003ee510}, 0xc0005eca00)
	/go/src/volcano.sh/devices/pkg/plugin/vgpu4pd/plugin.go:326 +0x353
k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1._DevicePlugin_Allocate_Handler({0x12920a0?, 0xc00038ec80}, {0x14cfee0, 0xc0003ee510}, 0xc00062e060, 0x0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/apis/deviceplugin/v1beta1/api.pb.go:1192 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0003a41a0, {0x14d4df8, 0xc0004cc000}, 0xc0000b2100, 0xc000367aa0, 0x1ce57f8, 0x0)
	/go/pkg/mod/google.golang.org/[email protected]/server.go:1082 +0xcab
google.golang.org/grpc.(*Server).handleStream(0xc0003a41a0, {0x14d4df8, 0xc0004cc000}, 0xc0000b2100, 0x0)
	/go/pkg/mod/google.golang.org/[email protected]/server.go:1405 +0xa13
google.golang.org/grpc.(*Server).serveStreams.func1.1()
	/go/pkg/mod/google.golang.org/[email protected]/server.go:746 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/pkg/mod/google.golang.org/[email protected]/server.go:744 +0xea

Improve the logic of finding candidate pod in Allocate RPC

Currently in the device plugin Allocate RPC, we need to find the candidate pod according to the container in the request.
If there are multiple gpu containers in one pod, obviously there will be logic problems when finding the candidate pod.

func (m *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
	var reqCount uint
	for _, req := range reqs.ContainerRequests {
		reqCount += uint(len(req.DevicesIDs))
	}

	responses := pluginapi.AllocateResponse{}

	firstContainerReq := reqs.ContainerRequests[0]
	firstContainerReqDeviceCount := uint(len(firstContainerReq.DevicesIDs))

	availablePods := podSlice{}
	pendingPods, err := m.kubeInteractor.GetPendingPodsOnNode()
	if err != nil {
		return nil, err
	}
	for _, pod := range pendingPods {
		current := pod
		if IsGPURequiredPod(&current) && !IsGPUAssignedPod(&current) && !IsShouldDeletePod(&current) {
			availablePods = append(availablePods, &current)
		}
	}

	sort.Sort(availablePods)

	var candidatePod *v1.Pod
	for _, pod := range availablePods {
		for i, c := range pod.Spec.Containers {
			if !IsGPURequiredContainer(&c) {
				continue
			}

			if GetGPUResourceOfContainer(&pod.Spec.Containers[i]) == firstContainerReqDeviceCount {
				klog.Infof("Got candidate Pod %s(%s), the device count is: %d", pod.UID, c.Name, firstContainerReqDeviceCount)
				candidatePod = pod
				goto Allocate
			}
		}
	}

        ....

Move related GPU code to pkg/gpu

As we're going to support different devices, so prefer to iolated the code by directory; and share the common util at package, e.g. pkg/common.

使用gpu-number 导致 schduler crash

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: ocr-job 
spec:
  minAvailable: 1
  schedulerName: volcano
  queue: default 
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: ocr
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        spec:
          containers:
            - image: ai-grpc-ocr:v1.4 
              name: ocr
              resources:
                requests:
                  volcano.sh/gpu-number: 1
                  #nvidia.com/gpu: 1 
                limits:
                  volcano.sh/gpu-number: 1
                  #nvidia.com/gpu: 1 
          restartPolicy: Never
    - replicas: 1
      name: ocr-2
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        spec:
          containers:
            - image: ai-grpc-ocr:v1.4
              name: ocr
              resources:
                requests:
                  volcano.sh/gpu-number: 1
                  #nvidia.com/gpu: 1 
                limits:
                  volcano.sh/gpu-number: 1
                  #nvidia.com/gpu: 1 
          restartPolicy: Never

log

$ k get no 
NAME          STATUS   ROLES    AGE    VERSION
10.122.2.14   Ready    <none>   42d    v1.26.1
10.122.2.26   Ready    <none>   154m   v1.26.1
10.122.2.37   Ready    <none>   44m    v1.26.1
$ k get po                                      
NAME                                   READY   STATUS             RESTARTS      AGE
ocr-job-ocr-0                          0/1     Pending            0             4m33s
ocr-job-ocr-2-0                        0/1     Pending            0             4m33s
volcano-admission-7f76fc8cf4-rcp85     1/1     Running            0             35d
volcano-admission-init-785w6           0/1     Completed          0             35d
volcano-controllers-6875c95bd7-zs49k   1/1     Running            0             35d
volcano-scheduler-6dcf84d54d-gcwxm     0/1     CrashLoopBackOff   9 (80s ago)   58m

$ k get po       
NAME                                   READY   STATUS             RESTARTS      AGE
ocr-job-ocr-0                          0/1     Pending            0             41s
ocr-job-ocr-2-0                        0/1     Pending            0             41s
volcano-admission-7f76fc8cf4-rcp85     1/1     Running            0             35d
volcano-admission-init-785w6           0/1     Completed          0             35d
volcano-controllers-6875c95bd7-zs49k   1/1     Running            0             35d
volcano-scheduler-6dcf84d54d-zg4d2     0/1     CrashLoopBackOff   2 (25s ago)   4m20s
 I0816 12:30:34.061174       1 allocate.go:180] There are <3> nodes for Job <volcano-system/ocr-job-11507a57-1b68-46ad-83bf-38e0c2d76f99>
I0816 12:30:34.061251       1 predicate_helper.go:74] Predicates failed for task <volcano-system/ocr-job-ocr-0> on node <10.122.2.14>: task volcano-system/ocr-job-ocr-0 on node 10.122.2.14 fit failed: Insufficient volcano.sh/gpu-number
E0816 12:30:34.061385       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 334 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1caa960?, 0x32ac650})
	/go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00102cf70?})
	/go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x1caa960, 0x32ac650})
	/usr/local/go/src/runtime/panic.go:884 +0x212
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.getDevicesIdleGPUs(...)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:64
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.predicateGPUbyNumber(0xc000dadac0?, 0x0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:166 +0x41
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.checkNodeGPUNumberPredicate(0xc000c68cf0?, 0x0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:140 +0x3f
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.(*GPUDevices).FilterNode(0x1c8bec0?, 0xc000dadac0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/device_info.go:161 +0x157
volcano.sh/volcano/pkg/scheduler/plugins/predicates.(*predicatesPlugin).OnSessionOpen.func4(0xc000848be0, 0xc0004e0180)
	/go/src/volcano.sh/volcano/pkg/scheduler/plugins/predicates/predicates.go:522 +0x16e4
volcano.sh/volcano/pkg/scheduler/framework.(*Session).PredicateFn(0xc001094000, 0xc00100df80?, 0x0?)
	/go/src/volcano.sh/volcano/pkg/scheduler/framework/session_plugins.go:615 +0x1ce
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).Execute.func1(0xc000848be0, 0xc0004e0180)
	/go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:106 +0x1cb
volcano.sh/volcano/pkg/scheduler/util.(*predicateHelper).PredicateNodes.func1(0xc0002045a0?)
	/go/src/volcano.sh/volcano/pkg/scheduler/util/predicate_helper.go:73 +0x3a2
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
	/go/src/volcano.sh/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:90 +0x106
created by k8s.io/client-go/util/workqueue.ParallelizeUntil
	/go/src/volcano.sh/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:76 +0x1d7
I0816 12:30:34.061465       1 statement.go:352] Discarding operations ...
I0816 12:30:34.061494       1 allocate.go:135] Try to allocate resource to Jobs in Queue <default>
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x15ab261]

goroutine 334 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00102cf70?})
	/go/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xd7
panic({0x1caa960, 0x32ac650})
	/usr/local/go/src/runtime/panic.go:884 +0x212
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.getDevicesIdleGPUs(...)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:64
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.predicateGPUbyNumber(0xc000dadac0?, 0x0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:166 +0x41
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.checkNodeGPUNumberPredicate(0xc000c68cf0?, 0x0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/share.go:140 +0x3f
volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare.(*GPUDevices).FilterNode(0x1c8bec0?, 0xc000dadac0)
	/go/src/volcano.sh/volcano/pkg/scheduler/api/devices/nvidia/gpushare/device_info.go:161 +0x157
volcano.sh/volcano/pkg/scheduler/plugins/predicates.(*predicatesPlugin).OnSessionOpen.func4(0xc000848be0, 0xc0004e0180)
	/go/src/volcano.sh/volcano/pkg/scheduler/plugins/predicates/predicates.go:522 +0x16e4
volcano.sh/volcano/pkg/scheduler/framework.(*Session).PredicateFn(0xc001094000, 0xc00100df80?, 0x0?)
	/go/src/volcano.sh/volcano/pkg/scheduler/framework/session_plugins.go:615 +0x1ce
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).Execute.func1(0xc000848be0, 0xc0004e0180)
	/go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:106 +0x1cb
volcano.sh/volcano/pkg/scheduler/util.(*predicateHelper).PredicateNodes.func1(0xc0002045a0?)
	/go/src/volcano.sh/volcano/pkg/scheduler/util/predicate_helper.go:73 +0x3a2
k8s.io/client-go/util/workqueue.ParallelizeUntil.func1()
	/go/src/volcano.sh/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:90 +0x106
created by k8s.io/client-go/util/workqueue.ParallelizeUntil
	/go/src/volcano.sh/volcano/vendor/k8s.io/client-go/util/workqueue/parallelizer.go:76 +0x1d7
..

3个节点, 10.122.2.26 / 10.122.2.37 是 gpu 机器, ;10.122.2.14是 cpu 机器。
切换成 nvidia.com/gpu 这个资源,直接调度失败。 目前原因未知

是7月份部署的。 latest 镜像

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yamlkubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

when use volcano.sh/gpu-number: 1,why the pod has all GPU?

Maybe a bug

when i request i gpu for k8s (the yaml in #10), I find the volcano-deviceplugin give all the node gpus to my pod.

the picture is flowed(in /dev/):
gpu

I donot know why

k8s:1.17.3
volcano:v1.0.1
volcano-deviceplugin:1.0.0
docker:18.06.3-ce
os : ubuntu18.04
arm : x86

The `args` configuration for the `containers` in the `volcano-device-plugin.yaml` is incorrect.

When I attempted to install the plugin, the pod gave the following error:

flag provided but not defined: -gpu-strategy
Usage of volcano-device-plugin:

I tried modifying the args in volcano-device-plugin.yaml from ["--gpu-strategy=share", "--gpu-memory-factor=1"] to ["---gpu-strategy=share", "---gpu-memory-factor=1"]. However, the error message then became:

flag provided but not defined: ----gpu-strategy
Usage of volcano-device-plugin:

In the end, I had to remove the args to successfully create the pod.

gpu-memory is just for a gpu memery claim?

for now, “volcano.sh/gpu-memory” can essentially only set 3 environment variables based on https://github.com/volcano-sh/devices/blob/master/pkg/plugin/nvidia/server.go#L324 to tell user how much gpu memory he can use in current container?

for example, when set “volcano.sh/gpu-memory: 1024”, environment variables looks like below in container, in fact, process in this container can use at most 4096M gpu memory, even VOLCANO_GPU_ALLOCATED is 1024?

NVIDIA_VISIBLE_DEVICES: “0” 
VOLCANO_GPU_ALLOCATED: “1024” 
VOLCANO_GPU_TOTAL: “4096”

can anyone help me confirm if I understand correctly?

ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100

This issue is an extension of #18

What happened:
Applying volcano-device-plugin on a server using 8*V100 GPU, but get volcano.sh/gpu-memory:0 when describe nodes:
6481637913961_ pic
Same situation did not occur when using T4 or P4.
Tracing kubelet logs, found following error message:
6491637914696_ pic_hd
seems like sync message is too large.

What caused this bug:
volcano-device-plugin mock GPUs into a device list(every device in this list is considered as a 1MB memory block), so that different workloads can share one GPU through kubernetes device plugin mechanism. When large memory GPU such as V100 is implemented, the size of device list exceeds the bound, and ListAndWatch failed as a result.

Solutions:
The key is to minimize the size of the device list, so we can consider each device as a 10MB memory block and reform the whole bookkeeping process according to this assumption. This accuracy is enough for almost all production environments.

master code can't run

2023/08/14 12:05:25 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:25 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
2023/08/14 12:05:26 Could not start device plugin for 'volcano.sh/gpu-memory': listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 Plugin Volcano-GPU-Plugin failed to start: listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:26 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
2023/08/14 12:05:26 Could not start device plugin for 'volcano.sh/gpu-memory': listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 Plugin Volcano-GPU-Plugin failed to start: listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:26 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
2023/08/14 12:05:26 Could not start device plugin for 'volcano.sh/gpu-memory': listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 Plugin Volcano-GPU-Plugin failed to start: listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:26 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:26 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start
2023/08/14 12:05:27 Could not start device plugin for 'volcano.sh/gpu-memory': listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:27 Plugin Volcano-GPU-Plugin failed to start: listen unix /var/lib/kubelet/device-plugins/volcano.sock: bind: address already in use
2023/08/14 12:05:27 You can check the prerequisites at: https://github.com/volcano-sh/k8s-device-plugin#prerequisites
2023/08/14 12:05:27 You can learn how to set the runtime at: https://github.com/volcano-sh/k8s-device-plugin#quick-start

Warning UnexpectedAdmissionErro

运行pod报错 Warning UnexpectedAdmissionError 79s kubelet Allocate failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected @ @

Can I compile the source code in the arm architecture for volcano device plugin? Is there a plan to develop an arm architecture

The tag in the containers field:

image: volcanosh/volcano-device-plugin:1.0.0-ubuntu20.04
can't be found on Docker Hub currently. Is it equivalent to the volcanosh/volcano-device-plugin:latest tag?

Additionally, after actually pulling the latest image, I found it only compatible with the x86 environment. Pods on ARM architecture machines will report:
standard_init_linux.go:220: exec user process caused "exec format error"
libcontainer: container start initialization failed: standard_init_linux.go:220: exec user process caused "exec format error"

rpc error: code = Unknown desc = failed to find gpu id

Maybe is a bug

The yaml is

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mindx-dls-gpu
  namespace: vcjob
spec:
  minAvailable: 1
  schedulerName: volcano
  policies:
    - event: PodEvicted
      action: RestartJob
  maxRetry: 3
  queue: default
  tasks:
  - name: "default-1p"
    replicas: 1
    template:
      metadata:
        labels:
          app: tf
      spec:
        containers:
        - image: nvidia-train:v1
          imagePullPolicy: IfNotPresent
          name: cuda-container
          command:
          - "/bin/bash"
          - "-c"
          #- "chmod 777 -R /job;cd /job/code/ModelZoo_Resnet50_HC; bash train_start.sh"
          args: [ "while true; do sleep 3000000; done;"  ]
          resources:
            requests:
              volcano.sh/gpu-number: 1
            limits:
              volcano.sh/gpu-number: 1
          volumeMounts:
          - name: timezone
            mountPath: /etc/timezone
          - name: localtime
            mountPath: /etc/localtime
        nodeSelector:
          accelerator: nvidia-tesla-v100
        volumes:
        - name: timezone
          hostPath:
            path: /etc/timezone
        - name: localtime
          hostPath:
            path: /etc/timezone
  name:
        restartPolicy: OnFailure`

And I use "olcano.sh/gpu-memory" resource is error:

Nov 10 20:11:23 ubuntu560 kubelet[26515]: E1110 20:11:23.149895   26515 manager.go:374] Failed to allocate device plugin resource for pod 28bc8549-e3b9-40f6-8adb-7830f967d97b: rpc error: code = Unknown desc = failed to find gpu id
Nov 10 20:11:23 ubuntu560 kubelet[26515]: W1110 20:11:23.149941   26515 predicate.go:74] Failed to admit pod mindx-dls-gpu-default-1p-0_vcjob(28bc8549-e3b9-40f6-8adb-7830f967d97b) - Update plugin resources failed due to rpc error: code = Unknown desc = failed to find gpu id, which is unexpected.

env:

volcano:v1.0.1
volcano-deviceplugin:v1.0.0
os:ubuntu 18.04  amd64

so is volcano.sh/gpu-memory support?

volcano.sh/gpu-memory: 0

NVIDIA Telas V100 * 8
cuda10.1 driver
volcano-device-plugin
能够获取到volcano.sh/gpu-number: 8
获取不到显存 volcano.sh/gpu-memory: 0
这是什么原因?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.