Giter Site home page Giter Site logo

gpumounter's Introduction

XinYuan Zhihu

🌱 Things I am currently working on:

  • Docker & Kubernetes & GPU
  • Golang & Python & Java

gpumounter's People

Contributors

ilyee avatar influencerngzk avatar pokerfacesad avatar thinkblue1991 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gpumounter's Issues

Slave pod creating will failed if the owner pod namespace enabled Resource Quotas

Refer to the K8S documentation

If quota is enabled in a namespace for compute resources like cpu and memory, users must specify requests or limits for those values; otherwise, the quota system may reject pod creation.

So slave pod creating will be failed if the owner pod namespace enabled resource quotas.

pods "xxx" is forbidden

And if we create slave pod in the owner pod namespace by set a resource quota, the slave pod need to consume resource quotas. It is unreasonable in a multi-tenant cluster scenario.

Slave Pod BestEffort QoS may lead to GPU resource leak

In current version, slave pod QoS class is BestEffort.
The slave pod will most likely down when an eviction occurs. And it will lead to GPU resource leak( the user pod can still use GPU resource but GPU Mounter and kube-scheduler don't know at all).

gpu-mounter-worker Error: Can not connect to /var/lib/kubelet/pod-resources/kubelet.sock

k8s version:v1.14
docker version:18.09.5

测试进群中的kubelet.sock在宿主机的/var/lib/kubelet/device-plugins

所以修改了gpu-mounter-worker.yaml文件中的挂载位置

     volumes:
        - name: cgroup
          hostPath:
            type: Directory
            path: /sys/fs/cgroup
        - name: device-monitor
          hostPath:
            type: Directory
            #path: /var/lib/kubelet/pod-resources
            path: /var/lib/kubelet/device-plugins
        - name: log-dir
          hostPath:
            type: DirectoryOrCreate
            path: /etc/GPUMounter/log

报错信息如下:

[root@t32 deploy]# kubectl logs -f gpu-mounter-workers-2wfnp    -n kube-system
2020-12-20T12:30:27.657Z	INFO	GPUMounter-worker/main.go:15	Service Starting...
2020-12-20T12:30:27.657Z	INFO	gpu-mount/server.go:21	Creating gpu mounter
2020-12-20T12:30:27.657Z	INFO	allocator/allocator.go:26	Creating gpu allocator
2020-12-20T12:30:27.657Z	INFO	collector/collector.go:23	Creating gpu collector
2020-12-20T12:30:27.657Z	INFO	collector/collector.go:41	Start get gpu info
2020-12-20T12:30:27.660Z	INFO	collector/collector.go:52	GPU Num: 2
2020-12-20T12:30:27.674Z	ERROR	collector/collector.go:106	Can not connect to /var/lib/kubelet/pod-resources/kubelet.sock
2020-12-20T12:30:27.674Z	ERROR	collector/collector.go:107	failure getting pod resources rpc error: code = Unimplemented desc = unknown service v1alpha1.PodResourcesLister
2020-12-20T12:30:27.674Z	ERROR	collector/collector.go:32	Failed to update gpu status
2020-12-20T12:30:27.674Z	ERROR	allocator/allocator.go:30	Failed to init gpu collector
2020-12-20T12:30:27.674Z	ERROR	gpu-mount/server.go:25	Filed to init gpu allocator
2020-12-20T12:30:27.674Z	ERROR	GPUMounter-worker/main.go:18	Failed to init gpu mounter
2020-12-20T12:30:27.674Z	ERROR	GPUMounter-worker/main.go:19	failure getting pod resources rpc error: code = Unimplemented desc = unknown service v1alpha1.PodResourcesLister

More graceful dependency management

GPU Mounter depends on KubeletPodResources api to get GPU usage from kubelet.

podresourcesapi "k8s.io/kubernetes/pkg/kubelet/apis/podresources/v1alpha1"

The KubeletPodResources api is import from k8s.io/kubernetes directly.
Refer to kubernetes/issues/79384 and go/issues/32776, it is necessary to add require directives for matching versions of all of the subcomponents.

GPUMounter/go.mod

Lines 14 to 39 in f827c71

k8s.io/kubernetes v1.16.0
)
replace (
k8s.io/api => k8s.io/api v0.0.0-20190918155943-95b840bb6a1f
k8s.io/apiextensions-apiserver => k8s.io/apiextensions-apiserver v0.0.0-20190918161926-8f644eb6e783
k8s.io/apimachinery => k8s.io/apimachinery v0.0.0-20190913080033-27d36303b655
k8s.io/apiserver => k8s.io/apiserver v0.0.0-20190918160949-bfa5e2e684ad
k8s.io/cli-runtime => k8s.io/cli-runtime v0.0.0-20190918162238-f783a3654da8
k8s.io/client-go => k8s.io/client-go v0.0.0-20190918160344-1fbdaa4c8d90
k8s.io/cloud-provider => k8s.io/cloud-provider v0.0.0-20190918163234-a9c1f33e9fb9
k8s.io/cluster-bootstrap => k8s.io/cluster-bootstrap v0.0.0-20190918163108-da9fdfce26bb
k8s.io/code-generator => k8s.io/code-generator v0.0.0-20190912054826-cd179ad6a269
k8s.io/component-base => k8s.io/component-base v0.0.0-20190918160511-547f6c5d7090
k8s.io/cri-api => k8s.io/cri-api v0.16.5-beta.1
k8s.io/csi-translation-lib => k8s.io/csi-translation-lib v0.0.0-20190918163402-db86a8c7bb21
k8s.io/kube-aggregator => k8s.io/kube-aggregator v0.0.0-20190918161219-8c8f079fddc3
k8s.io/kube-controller-manager => k8s.io/kube-controller-manager v0.0.0-20190918162944-7a93a0ddadd8
k8s.io/kube-proxy => k8s.io/kube-proxy v0.0.0-20190918162534-de037b596c1e
k8s.io/kube-scheduler => k8s.io/kube-scheduler v0.0.0-20190918162820-3b5c1246eb18
k8s.io/kubectl => k8s.io/kubectl v0.0.0-20190918164019-21692a0861df
k8s.io/kubelet => k8s.io/kubelet v0.0.0-20190918162654-250a1838aa2c
k8s.io/legacy-cloud-providers => k8s.io/legacy-cloud-providers v0.0.0-20190918163543-cfa506e53441
k8s.io/metrics => k8s.io/metrics v0.0.0-20190918162108-227c654b2546
k8s.io/sample-apiserver => k8s.io/sample-apiserver v0.0.0-20190918161442-d4c9c65c82af
)

Can not use GPUMounter on k8s

environment:

  • k8s 1.16.15
  • docker 20.10.10

problem: following QuickStart.md, I install GPUMounter successfully in my k8s. However, never request remove gpu and add gpu sucessfully.

I pasted some logs from gpu-mounter-master-container:

remove gpu
2022-02-18T03:44:55.184Z INFO GPUMounter-master/main.go:120 access remove gpu service
2022-02-18T03:44:55.184Z INFO GPUMounter-master/main.go:134 GPU-5d237016-9ea5-77bd-8c2f-2b3fd4bfa2cd
2022-02-18T03:44:55.184Z INFO GPUMounter-master/main.go:135 GPU-5d237016-9ea5-77bd-8c2f-2b3fd4bfa2cd
2022-02-18T03:44:55.184Z INFO GPUMounter-master/main.go:146 Pod: jupyter-lab-54d76f5d58-rlklh Namespace: default UUIDs: GPU-5d237016-9ea5-77bd-8c2f-2b3fd4bfa2cd force: true
2022-02-18T03:44:55.188Z INFO GPUMounter-master/main.go:169 Found Pod: jupyter-lab-54d76f5d58-rlklh in Namespace: default on Node: dev06.ucd.qzm.stonewise.cn
2022-02-18T03:44:55.193Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-fbfj8 Node: dev05.ucd.qzm.stonewise.cn
2022-02-18T03:44:55.193Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-kwmsn Node: dev06.ucd.qzm.stonewise.cn
2022-02-18T03:44:55.201Z ERROR GPUMounter-master/main.go:217 Invalid UUIDs: GPU-5d237016-9ea5-77bd-8c2f-2b3fd4bfa2cd

add gpu
2022-02-18T03:42:22.897Z INFO GPUMounter-master/main.go:25 access add gpu service
2022-02-18T03:42:22.898Z INFO GPUMounter-master/main.go:30 Pod: jupyter-lab-54d76f5d58-rlklh Namespace: default GPU Num: 4 Is entire mount: false
2022-02-18T03:42:22.902Z INFO GPUMounter-master/main.go:66 Found Pod: jupyter-lab-54d76f5d58-rlklh in Namespace: default on Node: dev06.ucd.qzm.stonewise.cn
2022-02-18T03:42:22.907Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-fbfj8 Node: dev05.ucd.qzm.stonewise.cn
2022-02-18T03:42:22.907Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-kwmsn Node: dev06.ucd.qzm.stonewise.cn
2022-02-18T03:42:22.921Z ERROR GPUMounter-master/main.go:98 Failed to call add gpu service
2022-02-18T03:42:22.921Z ERROR GPUMounter-master/main.go:99 rpc error: code = Unknown desc = FailedCreated

Some TODOs after merge PR #15

Thanks for @ilyee add gang scheduling support in #15 which means gang scheduler can be selected when we add multi GPUs.

And still sth. need to fix:

  • Add the relevant docs

  • Add the relevant RESTful API

  • allocator/allocator.go:159 log format error

  • need to input all GPU uuids when unmount after gang mount

试用过程中发现的问题

1.使用readme中的例子(yaml文件如下),会出现将节点调度到非GPU宿主机上

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: tensorflow/tensorflow:1.13.2-gpu
      command: ["/bin/sh"]
      args: ["-c", "while true; do echo hello; sleep 10;done"]
      env:
       - name: NVIDIA_VISIBLE_DEVICES
         value: "none"
  1. 如果镜像中没有mknod执行脚本,会出错。查看源码发现的确需要有mknod执行脚本,这个是不是应该在README中说明
  2. 如果AddGPU失败,新创建的gpu-pod-slave-* POD会一直占用GPU资源(除非将gpu-pod删除),是否应该设置callback?

GPUMounter-worker error in k8s v1.23.1

GPUMounter-master.log:
2022-01-16T11:24:14.610Z INFO GPUMounter-master/main.go:25 access add gpu service
2022-01-16T11:24:14.610Z INFO GPUMounter-master/main.go:30 Pod: test Namespace: default GPU Num: 1 Is entire mount: false
2022-01-16T11:24:14.627Z INFO GPUMounter-master/main.go:66 Found Pod: test in Namespace: default on Node: rtxws
2022-01-16T11:24:14.634Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-7dsdf Node: rtxws
2022-01-16T11:24:19.648Z ERROR GPUMounter-master/main.go:98 Failed to call add gpu service
2022-01-16T11:24:19.648Z ERROR GPUMounter-master/main.go:99 rpc error: code = Unknown desc = Service Internal Error


GPUMounter-worker.log:
2022-01-16T11:24:14.635Z INFO gpu-mount/server.go:35 AddGPU Service Called
2022-01-16T11:24:14.635Z INFO gpu-mount/server.go:36 request: pod_name:"test" namespace:"default" gpu_num:1
2022-01-16T11:24:14.645Z INFO gpu-mount/server.go:55 Successfully get Pod: default in cluster
2022-01-16T11:24:14.645Z INFO allocator/allocator.go:159 Get pod default/test mount type
2022-01-16T11:24:14.645Z INFO collector/collector.go:91 Updating GPU status
2022-01-16T11:24:14.646Z INFO collector/collector.go:136 GPU status update successfully
2022-01-16T11:24:14.657Z INFO allocator/allocator.go:59 Creating GPU Slave Pod: test-slave-pod-2f66ed for Owner Pod: test
2022-01-16T11:24:14.657Z INFO allocator/allocator.go:238 Checking Pods: test-slave-pod-2f66ed state
2022-01-16T11:24:14.661Z INFO allocator/allocator.go:264 Pod: test-slave-pod-2f66ed creating
2022-01-16T11:24:19.442Z INFO allocator/allocator.go:277 Pods: test-slave-pod-2f66ed are running
2022-01-16T11:24:19.442Z INFO allocator/allocator.go:84 Successfully create Slave Pod: %s, for Owner Pod: %s test-slave-pod-2f66edtest
2022-01-16T11:24:19.442Z INFO collector/collector.go:91 Updating GPU status
2022-01-16T11:24:19.444Z DEBUG collector/collector.go:130 GPU: /dev/nvidia0 allocated to Pod: test-slave-pod-2f66ed in Namespace gpu-pool
2022-01-16T11:24:19.444Z INFO collector/collector.go:136 GPU status update successfully
2022-01-16T11:24:19.444Z INFO gpu-mount/server.go:81 Start mounting, Total: 1 Current: 1
2022-01-16T11:24:19.444Z INFO util/util.go:19 Start mount GPU: {"MinorNumber":0,"DeviceFilePath":"/dev/nvidia0","UUID":"GPU-7fe47fc1-b21e-e675-f6ff-edd91910f8a7","State":"GPU_ALLOCATED_STATE","PodName":"test-slave-pod-2f66ed","Namespace":"gpu-pool"} to Pod: test
2022-01-16T11:24:19.444Z INFO util/util.go:24 Pod :test container ID: e317ca7f5eb5e3c523fab9f0744a065cd69013a7c09522318d4bbf98ad0bb1c3
2022-01-16T11:24:19.444Z INFO util/util.go:30 Successfully get cgroup path: /kubepods/burstable/podc815ee4b-bea0-44ed-8ef4-239e69516ba2/e317ca7f5eb5e3c523fab9f0744a065cd69013a7c09522318d4bbf98ad0bb1c3 for Pod: test
2022-01-16T11:24:19.445Z ERROR cgroup/cgroup.go:140 Exec "echo 'c 195:0 rw' > /sys/fs/cgroup/devices/kubepods/burstable/podc815ee4b-bea0-44ed-8ef4-239e69516ba2/e317ca7f5eb5e3c523fab9f0744a065cd69013a7c09522318d4bbf98ad0bb1c3/devices.allow" failed
2022-01-16T11:24:19.445Z ERROR cgroup/cgroup.go:141 Output: sh: 1: cannot create /sys/fs/cgroup/devices/kubepods/burstable/podc815ee4b-bea0-44ed-8ef4-239e69516ba2/e317ca7f5eb5e3c523fab9f0744a065cd69013a7c09522318d4bbf98ad0bb1c3/devices.allow: Directory nonexistent

2022-01-16T11:24:19.445Z ERROR cgroup/cgroup.go:142 exit status 2
2022-01-16T11:24:19.445Z ERROR util/util.go:33 Add GPU {"MinorNumber":0,"DeviceFilePath":"/dev/nvidia0","UUID":"GPU-7fe47fc1-b21e-e675-f6ff-edd91910f8a7","State":"GPU_ALLOCATED_STATE","PodName":"test-slave-pod-2f66ed","Namespace":"gpu-pool"}failed
2022-01-16T11:24:19.445Z ERROR gpu-mount/server.go:84 Mount GPU: {"MinorNumber":0,"DeviceFilePath":"/dev/nvidia0","UUID":"GPU-7fe47fc1-b21e-e675-f6ff-edd91910f8a7","State":"GPU_ALLOCATED_STATE","PodName":"test-slave-pod-2f66ed","Namespace":"gpu-pool"} to Pod: test in Namespace: default failed
2022-01-16T11:24:19.445Z ERROR gpu-mount/server.go:85 exit status 2


環境與版本

  • k8s version: v1.23
  • docker-client version:19.03.13
  • docekr-server version:20.10.12

在k8s v1.23裡, "/sys/fs/cgroup/devices/kubepods/burstable/pod[pod-id]/[container-id]/devices.allow" 改為 "/sys/fs/cgroup/devices/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod[pod-id]/docker-[container-id].scope/devices.allow"

所以當前GPUMounter在v1.23裡無法正常運作

是否可以更新至可符合k8s v1.23版,謝謝

Is it necessary to bind only one GPU to one slave pod?

Hello, I am elihe from Zhihu. I have seen your article in Zhihu before. After reading your code I have a question:
Why is each slave pod bound to only one GPU in GetAvailableGPU method of pkg/util/gpu/allocator/allocator.go?
As far as I'm concerned, in a large-scale cluster, this will bring additional load to the master node (there will be a larger number of pod creation requests); And the creation of multiple single-card pods may cause two competing GPU mount requests all failing (for example There are 4 available GPUs and two requests to mount 4 cards. One request successfully created slave pods 1 and 2, and the other created slave pods 3 and 4. They will all be unable to obtain more resources.)
If you agree with me, can I submit a merge request to optimize this?

gpu节点上,gpu-worker报错:nvml error: %+vcould not load NVML library

  • 在gpu节点上,gpu-worker报错日志信息如下:
[root@t32 ~]# kubectl  logs -f gpu-mounter-workers-ccqfv  -n kube-system
2021-02-10T01:01:09.689Z        INFO    GPUMounter-worker/main.go:15    Service Starting...
2021-02-10T01:01:09.690Z        INFO    gpu-mount/server.go:21  Creating gpu mounter
2021-02-10T01:01:09.690Z        INFO    allocator/allocator.go:27       Creating gpu allocator
2021-02-10T01:01:09.690Z        INFO    collector/collector.go:23       Creating gpu collector
2021-02-10T01:01:09.690Z        INFO    collector/collector.go:41       Start get gpu info
2021-02-10T01:01:09.690Z        ERROR   collector/collector.go:43       nvml error: %+vcould not load NVML library
2021-02-10T01:01:09.690Z        ERROR   collector/collector.go:26       Failed to init gpu collector
2021-02-10T01:01:09.690Z        ERROR   allocator/allocator.go:31       Failed to init gpu collector
2021-02-10T01:01:09.690Z        ERROR   gpu-mount/server.go:25  Filed to init gpu allocator
2021-02-10T01:01:09.690Z        ERROR   GPUMounter-worker/main.go:18    Failed to init gpu mounter
2021-02-10T01:01:09.690Z        ERROR   GPUMounter-worker/main.go:19    could not load NVML library
  • gpu-woker的pod调度节点如下:
[root@t32 ~]# kubectl  get pod  gpu-mounter-workers-ccqfv  -n kube-system -o wide 
NAME                        READY   STATUS             RESTARTS   AGE   IP           NODE   NOMINATED NODE   READINESS GATES
gpu-mounter-workers-ccqfv   0/1     ImagePullBackOff   1814       20d   10.42.5.33   t90    <none>           <none>
  • gpu节点信息如下
(base) root@t90:~# nvidia-smi 
Tue Feb 23 14:28:51 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:03:00.0 Off |                    0 |
| N/A   34C    P0    32W / 250W |   9114MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   34C    P0    31W / 250W |    520MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1024      C   python                                       869MiB |
|    0      1225      C   /usr/local/bin/python                       8235MiB |
|    1      1024      C   python                                       255MiB |
|    1      1225      C   /usr/local/bin/python                        255MiB |
+-----------------------------------------------------------------------------+

mount成功之后Slave-pod 过一会被杀死,导致不能卸载GPU

请问下大佬,mount流程创建slave-pod之后,这个slave-pod是不是应该一直存在,直到removeGPU?
我这边这个slave-pod,running之后不一会就被kill了,然后再removeGPU就失败了,这块被kill是什么原因?有啥思路不?是被驱逐了?这块从哪里排查比较好?感谢!

sol-UniServer-R4900-G3:~/go/src/github.com/jason-gideon/GPUMounter/example$ kubectl -n gpu-pool describe pod gpu-pod-slave-pod-6ffc13
Name:                      gpu-pod-slave-pod-6ffc13
Namespace:                 gpu-pool
Priority:                  0
Service Account:           default
Node:                      software-dell-r740-015/10.115.0.253
Start Time:                Tue, 06 Dec 2022 18:46:36 +0800
Labels:                    app=gpu-pool
Annotations:               cni.projectcalico.org/containerID: f3cbb407ae1601047a04a8e322b4eca80abd70df24f9de9e5f105586dd1d98fd
                           cni.projectcalico.org/podIP: 10.42.1.143/32
                           cni.projectcalico.org/podIPs: 10.42.1.143/32
                           k8s.v1.cni.cncf.io/network-status:
                             [{
                                 "name": "",
                                 "ips": [
                                     "10.42.1.143"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
                           k8s.v1.cni.cncf.io/networks-status:
                             [{
                                 "name": "",
                                 "ips": [
                                     "10.42.1.143"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
Status:                    Terminating (lasts <invalid>)
Termination Grace Period:  30s
IP:                        10.42.1.143
IPs:
  IP:           10.42.1.143
Controlled By:  Pod/gpu-pod
Containers:
  gpu-container:
    Container ID:  docker://e7f1f51dd6c3996d93172e1f56b3f955042d6f15726b6fb71745eb2bb6499707
    Image:         alpine:latest
    Image ID:      docker-pullable://alpine@sha256:8914eb54f968791faf6a8638949e480fef81e697984fba772b3976835194c6d4
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      while true; do echo this is a gpu pool container; sleep 10;done
    State:          Running
      Started:      Tue, 06 Dec 2022 18:46:41 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jhkdp (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-jhkdp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/hostname=software-dell-r740-015
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                    Age   From                          Message
  ----     ------                    ----  ----                          -------
  Normal   Scheduled                 30s   default-scheduler             Successfully assigned gpu-pool/gpu-pod-slave-pod-6ffc13 to software-dell-r740-015
  Warning  OwnerRefInvalidNamespace  30s   garbage-collector-controller  ownerRef [v1/Pod, namespace: gpu-pool, name: gpu-pod, uid: 6c482ef7-9acd-41ab-925e-101e166f75de] does not exist in namespace "gpu-pool"
  Normal   AddedInterface            28s   multus                        Add eth0 [10.42.1.143/32]
  Normal   Pulling                   28s   kubelet                       Pulling image "alpine:latest"
  Normal   Pulled                    26s   kubelet                       Successfully pulled image "alpine:latest" in 2.151216821s
  Normal   Created                   26s   kubelet                       Created container gpu-container
  Normal   Started                   25s   kubelet                       Started container gpu-container
  Normal   Killing                   10s   kubelet                       Stopping container gpu-container

Insufficient GPU on Node: xxx

第一次挂载成功了,后面卸载再次deploy 显示这个 Insufficient GPU on Node: yigou-dev-102-46,gpu 实际空闲

Wait until the slave pods deletion finished

In current version, when calling remove GPU service, it return without waiting for the deletion of slave pods.

It is unreasonable, because kubelet think the GPU resource is still being occupied by slave pod until the slave pod deletion finshed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.