Giter Site home page Giter Site logo

Comments (5)

pokerfaceSad avatar pokerfaceSad commented on September 24, 2024

describe node看下GPU资源是否空闲

from gpumounter.

lyon-v avatar lyon-v commented on September 24, 2024

1.master日志:
[root@yigou-dev-102-45 examples]# kubectl logs gpu-mounter-master-bc547448d-t5nkl -n kube-system
2023-11-07T14:32:53.960Z INFO GPUMounter-master/main.go:239 Start gpu mounter master on :8080
2023-11-08T01:04:59.791Z INFO GPUMounter-master/main.go:25 access add gpu service
2023-11-08T01:04:59.791Z INFO GPUMounter-master/main.go:30 Pod: gpu-pod-1 Namespace: default GPU Num: 1 Is entire mount: false
2023-11-08T01:04:59.812Z INFO GPUMounter-master/main.go:66 Found Pod: gpu-pod-1 in Namespace: default on Node: yigou-dev-102-46
2023-11-08T01:04:59.822Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-s8j2f Node: yigou-dev-102-46
2023-11-08T01:04:59.925Z ERROR GPUMounter-master/main.go:109 Insufficient GPU on Node: yigou-dev-102-46

############################################################################################
2.worker 日志:
[root@yigou-dev-102-45 elastic-jupyter]# kubectl logs gpu-mounter-workers-s8j2f -n kube-system
2023-11-07T14:30:26.648Z INFO GPUMounter-worker/main.go:15 Service Starting...
2023-11-07T14:30:26.648Z INFO gpu-mount/server.go:22 Creating gpu mounter
2023-11-07T14:30:26.648Z INFO allocator/allocator.go:28 Creating gpu allocator
2023-11-07T14:30:26.648Z INFO collector/collector.go:24 Creating gpu collector
2023-11-07T14:30:26.648Z INFO collector/collector.go:42 Start get gpu info
2023-11-07T14:30:26.652Z INFO collector/collector.go:53 GPU Num: 2
2023-11-07T14:30:26.664Z INFO collector/collector.go:91 Updating GPU status
2023-11-07T14:30:26.667Z INFO collector/collector.go:136 GPU status update successfully
2023-11-07T14:30:26.667Z INFO collector/collector.go:36 Successfully update gpu status
2023-11-07T14:30:26.667Z INFO allocator/allocator.go:35 Successfully created gpu collector
2023-11-07T14:30:26.667Z INFO gpu-mount/server.go:29 Successfully created gpu allocator
2023-11-07T14:30:26.667Z INFO GPUMounter-worker/main.go:22 Successfully created gpu mounter
2023-11-08T01:04:59.825Z INFO gpu-mount/server.go:35 AddGPU Service Called
2023-11-08T01:04:59.825Z INFO gpu-mount/server.go:36 request: pod_name:"gpu-pod-1" namespace:"default" gpu_num:1
2023-11-08T01:04:59.848Z INFO gpu-mount/server.go:55 Successfully get Pod: default in cluster
2023-11-08T01:04:59.848Z INFO allocator/allocator.go:159 Get pod default/gpu-pod-1 mount type
2023-11-08T01:04:59.848Z INFO collector/collector.go:91 Updating GPU status
2023-11-08T01:04:59.851Z INFO collector/collector.go:136 GPU status update successfully
2023-11-08T01:04:59.880Z INFO allocator/allocator.go:59 Creating GPU Slave Pod: gpu-pod-1-slave-pod-1d3148 for Owner Pod: gpu-pod-1
2023-11-08T01:04:59.880Z INFO allocator/allocator.go:238 Checking Pods: gpu-pod-1-slave-pod-1d3148 state
2023-11-08T01:04:59.882Z INFO allocator/allocator.go:264 Pod: gpu-pod-1-slave-pod-1d3148 creating
2023-11-08T01:04:59.886Z INFO allocator/allocator.go:264 Pod: gpu-pod-1-slave-pod-1d3148 creating
2023-11-08T01:04:59.917Z INFO allocator/allocator.go:268 No enough gpu for Pod: gpu-pod-1-slave-pod-1d3148
2023-11-08T01:04:59.925Z ERROR gpu-mount/server.go:70 Insufficient gpu for Pod: gpu-pod-1 Namespace: default

#######################################################################################
3.node 资源:
[root@yigou-dev-102-45 yamls]# kubectl describe node yigou-dev-102-46
Name: yigou-dev-102-46
Roles: cpu,training
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
fluid.io/dataset-num=1
fluid.io/f-fluid-fluiddataset=true
fluid.io/s-alluxio-fluid-fluiddataset=true
fluid.io/s-fluid-fluiddataset=true
fluid.io/s-h-alluxio-d-fluid-fluiddataset=5GiB
fluid.io/s-h-alluxio-t-fluid-fluiddataset=5GiB
gpu-mounter-enable=enable
kubernetes.io/arch=amd64
kubernetes.io/hostname=yigou-dev-102-46
kubernetes.io/os=linux
node-role.kubernetes.io/cpu=true
node-role.kubernetes.io/training=true
Annotations: csi.volume.kubernetes.io/nodeid: {"csi.tigera.io":"yigou-dev-102-46","fuse.csi.fluid.io":"yigou-dev-102-46"}
kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/cri-dockerd.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 10.0.102.46/24
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 28 Aug 2023 17:11:40 +0800
Taints:
Unschedulable: false
Lease:
HolderIdentity: yigou-dev-102-46
AcquireTime:
RenewTime: Wed, 08 Nov 2023 10:58:29 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


NetworkUnavailable False Tue, 07 Nov 2023 18:13:30 +0800 Tue, 07 Nov 2023 18:13:30 +0800 CalicoIsUp Calico is running on this node
MemoryPressure False Wed, 08 Nov 2023 10:58:23 +0800 Fri, 03 Nov 2023 10:39:46 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 08 Nov 2023 10:58:23 +0800 Fri, 03 Nov 2023 10:39:46 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 08 Nov 2023 10:58:23 +0800 Fri, 03 Nov 2023 10:39:46 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 08 Nov 2023 10:58:23 +0800 Tue, 07 Nov 2023 18:13:01 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.0.102.46
Hostname: yigou-dev-102-46
Capacity:
cpu: 32
ephemeral-storage: 1018975Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65801852Ki
nvidia.com/gpu: 0
nvidia.com/nvidia-rtx-3090: 2
pods: 110
Allocatable:
cpu: 32
ephemeral-storage: 961625455048
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65699452Ki
nvidia.com/gpu: 0
nvidia.com/nvidia-rtx-3090: 2
pods: 110
System Info:
Machine ID: 2a5da2a3fe9d480f97ca66b4d8f4287b
System UUID: 32B53042-47EB-9349-E452-0B470FA25211
Boot ID: 87d25649-faa9-45da-a0ce-5b4b102fd56e
Kernel Version: 3.10.0-1160.99.1.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://24.0.6
Kubelet Version: v0.0.0-master+5244794d27b4cc68290bc496b00e248857ac8b47
Kube-Proxy Version: v0.0.0-master+5244794d27b4cc68290bc496b00e248857ac8b47
PodCIDR: 100.64.1.0/24
PodCIDRs: 100.64.1.0/24
Non-terminated Pods: (30 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age


calico-system calico-node-gfzjw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d
calico-system calico-typha-856c6c9c4c-bzzbn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71d
calico-system csi-node-driver-vlxt9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71d
default gpu-pod-1 0 (0%) 0 (0%) 0 (0%) 0 (0%) 42s
elastic-jupyter-operator-system elastic-jupyter-operator-controller-manager-5d559bbbb8-77frj 100m (0%) 100m (0%) 20Mi (0%) 30Mi (0%) 23h
fluid-system alluxioruntime-controller-5b4fd8d788-56c4k 100m (0%) 100m (0%) 200Mi (0%) 1536Mi (2%) 49d
fluid-system csi-nodeplugin-fluid-4dw44 0 (0%) 0 (0%) 0 (0%) 0 (0%) 15d
fluid-system dataset-controller-665ff849b7-cnrh9 100m (0%) 100m (0%) 200Mi (0%) 1536Mi (2%) 49d
fluid-system dataset-controller-665ff849b7-xrvm7 100m (0%) 100m (0%) 200Mi (0%) 1536Mi (2%) 29d
fluid-system fluid-webhook-8689694b95-jsn49 0 (0%) 0 (0%) 0 (0%) 0 (0%) 49d
fluid-system fluidapp-controller-698b685d4f-z84r6 100m (0%) 100m (0%) 200Mi (0%) 1536Mi (2%) 49d
fluid-system thinruntime-controller-674bb4784b-qvcks 100m (0%) 100m (0%) 200Mi (0%) 1536Mi (2%) 49d
fluid fluiddataset-fuse-264x4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16d
fluid fluiddataset-master-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d
fluid fluiddataset-worker-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d
heros-controllers-system hero-controllers-controller-manager-cbdb77cf6-l7bjl 15m (0%) 1 (3%) 128Mi (0%) 256Mi (0%) 38h
heros-system file-proxy-5b5f76cf8d-xnj72 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
kube-system gpu-mounter-master-bc547448d-x77th 0 (0%) 0 (0%) 0 (0%) 0 (0%) 42m
kube-system gpu-mounter-workers-7lf7k 0 (0%) 0 (0%) 0 (0%) 0 (0%) 42m
kube-system kube-proxy-rh2vt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d
kube-system kube-sealos-lvscare-yigou-dev-102-46 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71d
kube-system tigera-operator-66fd59dc66-tn24j 0 (0%) 0 (0%) 0 (0%) 0 (0%) 71d
monitoring alertmanager-main-0 104m (0%) 200m (0%) 150Mi (0%) 150Mi (0%) 5d
monitoring cadvisor-8cwmf 400m (1%) 800m (2%) 400Mi (0%) 2000Mi (3%) 12d
monitoring dcgm-exporter-7rxkt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
monitoring loki-stack-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d
monitoring loki-stack-promtail-x9zwg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16d
monitoring node-exporter-ppdvb 112m (0%) 270m (0%) 200Mi (0%) 220Mi (0%) 63d
monitoring prometheus-k8s-0 100m (0%) 100m (0%) 450Mi (0%) 50Mi (0%) 5d
nvidia-device-plugin nvdp-nvidia-device-plugin-r4zjr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits


cpu 1331m (4%) 2970m (9%)
memory 2348Mi (3%) 10386Mi (16%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
nvidia.com/nvidia-rtx-3090 0 0
Events:
Type Reason Age From Message


Normal RegisteredNode 32m node-controller Node yigou-dev-102-46 event: Registered Node yigou-dev-102-46 in Controller

from gpumounter.

lyon-v avatar lyon-v commented on September 24, 2024

我可能说错了,第一次也没挂上gpu, 但是gpu是空闲的。上面的信息是master和worker的日志,和46节点的信息,gpu-pool 下面没有slave-pod.
[root@yigou-dev-102-45 ~]# kubelet --version
Kubernetes v1.25.13
这是k8s版本信息

from gpumounter.

pokerfaceSad avatar pokerfaceSad commented on September 24, 2024

在k8s 1.20+有一个已知问题,ownerReference不允许跨namespaces,因此slavePod会创建失败

参考下#19 (comment)

from gpumounter.

lyon-v avatar lyon-v commented on September 24, 2024

[root@yigou-dev-102-45 ~]# cat /etc/docker/daemon.json
{
"data-root": "/var/lib/docker",
"exec-opts": [
"native.cgroupdriver=systemd"
],
"insecure-registries": [
"registry.bitahub.com:5000",
"registry.hub.com:5000",
"docker-user.cambricon.com:30080",
"10.10.8.100:5000",
"10.11.3.8:5000",
"112.31.12.176:5000",
"10.12.4.35:5000",
"10.0.0.12:5000"
],
"log-driver": "json-file",
"log-level": "warn",
"log-opts": {
"max-file": "3",
"max-size": "10m"
},
"max-concurrent-downloads": 20,
"registry-mirrors": [
"https://reg-mirror.qiniu.com/",
"https://pqs5j944.mirror.aliyuncs.com",
"https://7bezldxe.mirror.aliyuncs.com/",
"https://registry.docker-cn.com",
"http://hub-mirror.c.163.com",
"https://docker.mirrors.ustc.edu.cn/"
],
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
上面是我的/etc/docker/daemon.json,我试试将改个环境变量

from gpumounter.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.