Giter Site home page Giter Site logo

Comments (4)

AshinWu avatar AshinWu commented on September 12, 2024 1

This is because volcano-device-plugin is a daemon service, and some nodes may cause pod exceptions due to a lack of GPU resources.
The temporary solution is to use taints or affinity to bypass these abnormal nodes.

from devices.

wangyang0616 avatar wangyang0616 commented on September 12, 2024

Would it be convenient for you to provide the following information so that we can locate the problem, we will be very grateful.

  1. Panic stack information.
  2. The yaml file installed by the volcano device plugin.
  3. The yaml file of the workload. (if it involves business information, it can be desensitized.)

from devices.

oldthreefeng avatar oldthreefeng commented on September 12, 2024
$ k get ds volcano-device-plugin -o yaml | neat   
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "6"
  name: volcano-device-plugin
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: volcano-device-plugin
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      creationTimestamp: null
      labels:
        name: volcano-device-plugin
    spec:
      containers:
      - args:
        - --gpu-strategy=number
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: volcanosh/volcano-device-plugin:latest
        imagePullPolicy: IfNotPresent
        name: volcano-device-plugin
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            add:
            - SYS_ADMIN
            drop:
            - ALL
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kubelet/device-plugins
          name: device-plugin
        - mountPath: /usr/local/vgpu
          name: lib
        - mountPath: /tmp
          name: hosttmp
      dnsPolicy: ClusterFirst
      nodeSelector:
        nvidia-device-enable: enable
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      serviceAccount: volcano-device-plugin
      serviceAccountName: volcano-device-plugin
      terminationGracePeriodSeconds: 30
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoSchedule
        key: volcano.sh/gpu-memory
        operator: Exists
      - key: volcano.sh/gpu
        operator: Exists
      volumes:
      - hostPath:
          path: /var/lib/kubelet/device-plugins
          type: ""
        name: device-plugin
      - hostPath:
          path: /usr/local/vgpu
          type: ""
        name: lib
      - hostPath:
          path: /tmp
          type: ""
        name: hosttmp
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

@wangyang0616

from devices.

oldthreefeng avatar oldthreefeng commented on September 12, 2024

yaml of workload is vcjob. as edit.

when change #volcano.sh/gpu-number: 1 to nvidia.com/gpu: 1 , error go away, but pod did not schduler. is pending.
but resource is enough.

from devices.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.