Giter Site home page Giter Site logo

Comments (5)

pokerfaceSad avatar pokerfaceSad commented on September 24, 2024

describe node看下GPU资源是否空闲

from gpumounter.

lyon-v avatar lyon-v commented on September 24, 2024

[root@yigou-dev-102-45 examples]# kubectl logs gpu-mounter-master-bc547448d-t5nkl -n kube-system
2023-11-07T14:32:53.960Z INFO GPUMounter-master/main.go:239 Start gpu mounter master on :8080
2023-11-08T01:04:59.791Z INFO GPUMounter-master/main.go:25 access add gpu service
2023-11-08T01:04:59.791Z INFO GPUMounter-master/main.go:30 Pod: gpu-pod-1 Namespace: default GPU Num: 1 Is entire mount: false
2023-11-08T01:04:59.812Z INFO GPUMounter-master/main.go:66 Found Pod: gpu-pod-1 in Namespace: default on Node: yigou-dev-102-46
2023-11-08T01:04:59.822Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-s8j2f Node: yigou-dev-102-46
2023-11-08T01:04:59.925Z ERROR GPUMounter-master/main.go:109 Insufficient GPU on Node: yigou-dev-102-46

2.worker 日志:
[root@yigou-dev-102-45 elastic-jupyter]# kubectl logs gpu-mounter-workers-s8j2f -n kube-system
2023-11-07T14:30:26.648Z INFO GPUMounter-worker/main.go:15 Service Starting...
2023-11-07T14:30:26.648Z INFO gpu-mount/server.go:22 Creating gpu mounter
2023-11-07T14:30:26.648Z INFO allocator/allocator.go:28 Creating gpu allocator
2023-11-07T14:30:26.648Z INFO collector/collector.go:24 Creating gpu collector
2023-11-07T14:30:26.648Z INFO collector/collector.go:42 Start get gpu info
2023-11-07T14:30:26.652Z INFO collector/collector.go:53 GPU Num: 2
2023-11-07T14:30:26.664Z INFO collector/collector.go:91 Updating GPU status
2023-11-07T14:30:26.667Z INFO collector/collector.go:136 GPU status update successfully
2023-11-07T14:30:26.667Z INFO collector/collector.go:36 Successfully update gpu status
2023-11-07T14:30:26.667Z INFO allocator/allocator.go:35 Successfully created gpu collector
2023-11-07T14:30:26.667Z INFO gpu-mount/server.go:29 Successfully created gpu allocator
2023-11-07T14:30:26.667Z INFO GPUMounter-worker/main.go:22 Successfully created gpu mounter
2023-11-08T01:04:59.825Z INFO gpu-mount/server.go:35 AddGPU Service Called
2023-11-08T01:04:59.825Z INFO gpu-mount/server.go:36 request: pod_name:"gpu-pod-1" namespace:"default" gpu_num:1
2023-11-08T01:04:59.848Z INFO gpu-mount/server.go:55 Successfully get Pod: default in cluster
2023-11-08T01:04:59.848Z INFO allocator/allocator.go:159 Get pod default/gpu-pod-1 mount type
2023-11-08T01:04:59.848Z INFO collector/collector.go:91 Updating GPU status
2023-11-08T01:04:59.851Z INFO collector/collector.go:136 GPU status update successfully
2023-11-08T01:04:59.880Z INFO allocator/allocator.go:59 Creating GPU Slave Pod: gpu-pod-1-slave-pod-1d3148 for Owner Pod: gpu-pod-1
2023-11-08T01:04:59.880Z INFO allocator/allocator.go:238 Checking Pods: gpu-pod-1-slave-pod-1d3148 state
2023-11-08T01:04:59.882Z INFO allocator/allocator.go:264 Pod: gpu-pod-1-slave-pod-1d3148 creating
2023-11-08T01:04:59.886Z INFO allocator/allocator.go:264 Pod: gpu-pod-1-slave-pod-1d3148 creating
2023-11-08T01:04:59.917Z INFO allocator/allocator.go:268 No enough gpu for Pod: gpu-pod-1-slave-pod-1d3148
2023-11-08T01:04:59.925Z ERROR gpu-mount/server.go:70 Insufficient gpu for Pod: gpu-pod-1 Namespace: default

3.node 资源:
[root@yigou-dev-102-45 yamls]# kubectl describe node yigou-dev-102-46
Name: yigou-dev-102-46
Roles: cpu,training
Annotations: {"":"yigou-dev-102-46","":"yigou-dev-102-46"} unix:///var/run/cri-dockerd.sock 0 true
CreationTimestamp: Mon, 28 Aug 2023 17:11:40 +0800
Unschedulable: false
HolderIdentity: yigou-dev-102-46
RenewTime: Wed, 08 Nov 2023 10:58:29 +0800
Type Status LastHeartbeatTime LastTransitionTime Reason Message

NetworkUnavailable False Tue, 07 Nov 2023 18:13:30 +0800 Tue, 07 Nov 2023 18:13:30 +0800 CalicoIsUp Calico is running on this node
MemoryPressure False Wed, 08 Nov 2023 10:58:23 +0800 Fri, 03 Nov 2023 10:39:46 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 08 Nov 2023 10:58:23 +0800 Fri, 03 Nov 2023 10:39:46 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 08 Nov 2023 10:58:23 +0800 Fri, 03 Nov 2023 10:39:46 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 08 Nov 2023 10:58:23 +0800 Tue, 07 Nov 2023 18:13:01 +0800 KubeletReady kubelet is posting ready status
Hostname: yigou-dev-102-46
cpu: 32
ephemeral-storage: 1018975Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65801852Ki 0 2
pods: 110
cpu: 32
ephemeral-storage: 961625455048
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65699452Ki 0 2
pods: 110
System Info:
Machine ID: 2a5da2a3fe9d480f97ca66b4d8f4287b
System UUID: 32B53042-47EB-9349-E452-0B470FA25211
Boot ID: 87d25649-faa9-45da-a0ce-5b4b102fd56e
Kernel Version: 3.10.0-1160.99.1.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://24.0.6
Kubelet Version: v0.0.0-master+5244794d27b4cc68290bc496b00e248857ac8b47
Kube-Proxy Version: v0.0.0-master+5244794d27b4cc68290bc496b00e248857ac8b47
Non-terminated Pods: (30 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age

calico-system calico-node-gfzjw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d
calico-system calico-typha-856c6c9c4c-bzzbn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71d
calico-system csi-node-driver-vlxt9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71d
default gpu-pod-1 0 (0%) 0 (0%) 0 (0%) 0 (0%) 42s
elastic-jupyter-operator-system elastic-jupyter-operator-controller-manager-5d559bbbb8-77frj 100m (0%) 100m (0%) 20Mi (0%) 30Mi (0%) 23h
fluid-system alluxioruntime-controller-5b4fd8d788-56c4k 100m (0%) 100m (0%) 200Mi (0%) 1536Mi (2%) 49d
fluid-system csi-nodeplugin-fluid-4dw44 0 (0%) 0 (0%) 0 (0%) 0 (0%) 15d
fluid-system dataset-controller-665ff849b7-cnrh9 100m (0%) 100m (0%) 200Mi (0%) 1536Mi (2%) 49d
fluid-system dataset-controller-665ff849b7-xrvm7 100m (0%) 100m (0%) 200Mi (0%) 1536Mi (2%) 29d
fluid-system fluid-webhook-8689694b95-jsn49 0 (0%) 0 (0%) 0 (0%) 0 (0%) 49d
fluid-system fluidapp-controller-698b685d4f-z84r6 100m (0%) 100m (0%) 200Mi (0%) 1536Mi (2%) 49d
fluid-system thinruntime-controller-674bb4784b-qvcks 100m (0%) 100m (0%) 200Mi (0%) 1536Mi (2%) 49d
fluid fluiddataset-fuse-264x4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16d
fluid fluiddataset-master-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d
fluid fluiddataset-worker-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d
heros-controllers-system hero-controllers-controller-manager-cbdb77cf6-l7bjl 15m (0%) 1 (3%) 128Mi (0%) 256Mi (0%) 38h
heros-system file-proxy-5b5f76cf8d-xnj72 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
kube-system gpu-mounter-master-bc547448d-x77th 0 (0%) 0 (0%) 0 (0%) 0 (0%) 42m
kube-system gpu-mounter-workers-7lf7k 0 (0%) 0 (0%) 0 (0%) 0 (0%) 42m
kube-system kube-proxy-rh2vt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d
kube-system kube-sealos-lvscare-yigou-dev-102-46 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71d
kube-system tigera-operator-66fd59dc66-tn24j 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71d
monitoring alertmanager-main-0 104m (0%) 200m (0%) 150Mi (0%) 150Mi (0%) 5d
monitoring cadvisor-8cwmf 400m (1%) 800m (2%) 400Mi (0%) 2000Mi (3%) 12d
monitoring dcgm-exporter-7rxkt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
monitoring loki-stack-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d
monitoring loki-stack-promtail-x9zwg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16d
monitoring node-exporter-ppdvb 112m (0%) 270m (0%) 200Mi (0%) 220Mi (0%) 63d
monitoring prometheus-k8s-0 100m (0%) 100m (0%) 450Mi (0%) 50Mi (0%) 5d
nvidia-device-plugin nvdp-nvidia-device-plugin-r4zjr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits

cpu 1331m (4%) 2970m (9%)
memory 2348Mi (3%) 10386Mi (16%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%) 0 0 0 0
Type Reason Age From Message

Normal RegisteredNode 32m node-controller Node yigou-dev-102-46 event: Registered Node yigou-dev-102-46 in Controller

from gpumounter.

lyon-v avatar lyon-v commented on September 24, 2024

我可能说错了,第一次也没挂上gpu, 但是gpu是空闲的。上面的信息是master和worker的日志,和46节点的信息,gpu-pool 下面没有slave-pod.
[root@yigou-dev-102-45 ~]# kubelet --version
Kubernetes v1.25.13

from gpumounter.

pokerfaceSad avatar pokerfaceSad commented on September 24, 2024

在k8s 1.20+有一个已知问题,ownerReference不允许跨namespaces,因此slavePod会创建失败

参考下#19 (comment)

from gpumounter.

lyon-v avatar lyon-v commented on September 24, 2024

[root@yigou-dev-102-45 ~]# cat /etc/docker/daemon.json
"data-root": "/var/lib/docker",
"exec-opts": [
"insecure-registries": [
"log-driver": "json-file",
"log-level": "warn",
"log-opts": {
"max-file": "3",
"max-size": "10m"
"max-concurrent-downloads": 20,
"registry-mirrors": [
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"

from gpumounter.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.