Comments (4)
This is because volcano-device-plugin is a daemon service, and some nodes may cause pod exceptions due to a lack of GPU resources.
The temporary solution is to use taints or affinity to bypass these abnormal nodes.
from devices.
Would it be convenient for you to provide the following information so that we can locate the problem, we will be very grateful.
- Panic stack information.
- The yaml file installed by the volcano device plugin.
- The yaml file of the workload. (if it involves business information, it can be desensitized.)
from devices.
$ k get ds volcano-device-plugin -o yaml | neat
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "6"
name: volcano-device-plugin
namespace: kube-system
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
name: volcano-device-plugin
template:
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
creationTimestamp: null
labels:
name: volcano-device-plugin
spec:
containers:
- args:
- --gpu-strategy=number
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: volcanosh/volcano-device-plugin:latest
imagePullPolicy: IfNotPresent
name: volcano-device-plugin
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- SYS_ADMIN
drop:
- ALL
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet/device-plugins
name: device-plugin
- mountPath: /usr/local/vgpu
name: lib
- mountPath: /tmp
name: hosttmp
dnsPolicy: ClusterFirst
nodeSelector:
nvidia-device-enable: enable
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
serviceAccount: volcano-device-plugin
serviceAccountName: volcano-device-plugin
terminationGracePeriodSeconds: 30
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
key: volcano.sh/gpu-memory
operator: Exists
- key: volcano.sh/gpu
operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet/device-plugins
type: ""
name: device-plugin
- hostPath:
path: /usr/local/vgpu
type: ""
name: lib
- hostPath:
path: /tmp
type: ""
name: hosttmp
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
from devices.
yaml of workload is vcjob. as edit.
when change #volcano.sh/gpu-number: 1 to nvidia.com/gpu: 1 , error go away, but pod did not schduler. is pending.
but resource is enough.
from devices.
Related Issues (20)
- ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100 HOT 16
- Add MIG Device support HOT 2
- Improve the logic of finding candidate pod in Allocate RPC HOT 1
- gpu number无法使用 HOT 8
- device-plugin painc HOT 20
- The Docker image tag in volcano-device-plugin.yaml is incorrect, and it only has an x86 version, there is no arm64 version available.
- The `args` configuration for the `containers` in the `volcano-device-plugin.yaml` is incorrect.
- master code can't run
- What version of cuda does vgpu support?
- volcano device plugin请问是否可以在arm架构中编译源码?是否有开发arm架构的计划?
- Can I compile the source code in the arm architecture for volcano device plugin? Is there a plan to develop an arm architecture
- run pod failed
- volcano.sh/vgpu-memory and volcano.sh/vgpu-number
- Unexpected Admission Error
- Failure of Device Plugin Communication with Kubernetes Kubelet
- vgpu 并发调度pod时,显存混乱
- 请问volcano 支持将job 调度到 nvidia 的指定显卡型号的node 上吗 HOT 2
- device plugin does not run on containerd HOT 2
- Support advertising specific GPU types as separate extended resource
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from devices.