Here are some ideas to get you started:
- 🔭 I’m currently working on Alibaba Cloud.
- 🌱 I’m currently learning Kubernetes Controller Develop
- 👯 I’m looking to collaborate on go dev/k8s dev.
- 🤔 I’m looking for help with goland.
This project forked from aliyuncontainerservice/gpushare-scheduler-extender
GPU topology Scheduler for Kubernetes Cluster
License: Apache License 2.0
[root@iZ8vbazwei4j05nbediqaeZ lijj]# curl 127.0.0.1:12345/gputopology-scheduler/inspect?detail=true
{"nodes":[null,null],"policy":"static"}
有一个疑问:
如下两条命令所示,在pod状态由running 到 completed 的过程中,资源的释放感知监控是如何实现的?
[root@iZ8vbazwei4j05nbediqaeZ lijj]# curl 127.0.0.1:12345/gputopology-scheduler/inspect?detail=true
{"nodes":[{"name":"cn-zhangjiakou.192.168.0.113","totalGPU":8,"usedGPU":2,"topology":[[0,7,0,0,6,6,8,6],[7,0,0,0,6,6,6,8],[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[6,6,0,0,0,8,7,8],[6,6,0,0,8,0,8,7],[8,6,0,0,7,8,0,7],[6,8,0,0,8,7,7,0]]},{"name":"cn-zhangjiakou.192.168.0.112","totalGPU":8,"usedGPU":0,"topology":[[0,7,7,8,6,6,8,6],[7,0,8,7,6,6,6,8],[7,8,0,8,7,6,6,6],[8,7,8,0,6,7,6,6],[6,6,7,6,0,8,7,8],[6,6,6,7,8,0,8,7],[8,6,6,6,7,8,0,7],[6,8,6,6,8,7,7,0]]}],"policy":"best_effort"}
[root@iZ8vbazwei4j05nbediqaeZ lijj]# curl 127.0.0.1:12345/gputopology-scheduler/inspect?detail=true
{"nodes":[{"name":"cn-zhangjiakou.192.168.0.113","totalGPU":8,"usedGPU":0,"topology":[[0,7,7,8,6,6,8,6],[7,0,8,7,6,6,6,8],[7,8,0,8,7,6,6,6],[8,7,8,0,6,7,6,6],[6,6,7,6,0,8,7,8],[6,6,6,7,8,0,8,7],[8,6,6,6,7,8,0,7],[6,8,6,6,8,7,7,0]]},{"name":"cn-zhangjiakou.192.168.0.112","totalGPU":8,"usedGPU":0,"topology":[[0,7,7,8,6,6,8,6],[7,0,8,7,6,6,6,8],[7,8,0,8,7,6,6,6],[8,7,8,0,6,7,6,6],[6,6,7,6,0,8,7,8],[6,6,6,7,8,0,8,7],[8,6,6,6,7,8,0,7],[6,8,6,6,8,7,7,0]]}],"policy":"best_effort"}
一开始 inspect 不能获取到某些节点的信息
[root@iZ8vbazwei4j05nbediqaeZ lijj]# curl 127.0.0.1:12345/gputopology-scheduler/inspect
{"nodes":[{"name":"cn-zhangjiakou.192.168.0.113","totalGPU":8,"usedGPU":0,"topology":[[0,7,7,8,6,6,8,6],[7,0,8,7,6,6,6,8],[7,8,0,8,7,6,6,6],[8,7,8,0,6,7,6,6],[6,6,7,6,0,8,7,8],[6,6,6,7,8,0,8,7],[8,6,6,6,7,8,0,7],[6,8,6,6,8,7,7,0]]}]}
但是,当分别获取每个节点信息之后,就可以通过 inspect 获取到所有节点信息了
[root@iZ8vbazwei4j05nbediqaeZ lijj]# curl 127.0.0.1:12345/gputopology-scheduler/inspect
{"nodes":[{"name":"cn-zhangjiakou.192.168.0.113","totalGPU":8,"usedGPU":0,"topology":[[0,7,7,8,6,6,8,6],[7,0,8,7,6,6,6,8],[7,8,0,8,7,6,6,6],[8,7,8,0,6,7,6,6],[6,6,7,6,0,8,7,8],[6,6,6,7,8,0,8,7],[8,6,6,6,7,8,0,7],[6,8,6,6,8,7,7,0]]}]}
[root@iZ8vbazwei4j05nbediqaeZ lijj]# curl 127.0.0.1:12345/gputopology-scheduler/inspect/cn-zhangjiakou.192.168.0.112
{"nodes":[{"name":"cn-zhangjiakou.192.168.0.112","totalGPU":8,"usedGPU":0,"topology":[[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0]]}]}
可以看到获取所有节点信息
[root@iZ8vbazwei4j05nbediqaeZ lijj]# curl 127.0.0.1:12345/gputopology-scheduler/inspect
{"nodes":[{"name":"cn-zhangjiakou.192.168.0.113","totalGPU":8,"usedGPU":0,"topology":[[0,7,7,8,6,6,8,6],[7,0,8,7,6,6,6,8],[7,8,0,8,7,6,6,6],[8,7,8,0,6,7,6,6],[6,6,7,6,0,8,7,8],[6,6,6,7,8,0,8,7],[8,6,6,6,7,8,0,7],[6,8,6,6,8,7,7,0]]},{"name":"cn-zhangjiakou.192.168.0.112","totalGPU":8,"usedGPU":0,"topology":[[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0]]}]}[root@iZ8vbazwei4j05nbediqaeZ lijj]#
启动一个 gpu topology 集群
可以正常使用获取到所有的 gpu topology。
但是当部署了一个机器学习应用后,这时候 usedgpu 会改变,topology 也会改变,这是正常情况
当删除这个机器学习应用后, usedgpu 恢复正常,但是 topology 结构却没有改变,现象如下:
[root@iZ8vbazwei4j05nbediqaeZ inspect]# ./inspect -d
gpu policy: best_effort
----------------------------------------
node name: cn-zhangjiakou.192.168.0.113
all gpu: 8
used gpu: 0
gpu topolocy
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X X X NV2 SYS SYS NV2 SYS
GPU1 X X X X X X X X
GPU2 X X X X X X X X
GPU3 NV2 X X X SYS NV1 SYS SYS
GPU4 SYS X X SYS X NV2 NV1 NV2
GPU5 SYS X X NV1 NV2 X NV2 NV1
GPU6 NV2 X X SYS NV1 NV2 X NV1
GPU7 SYS X X SYS NV2 NV1 NV1 X
----------------------------------------
node name: cn-zhangjiakou.192.168.0.112
all gpu: 8
used gpu: 0
gpu topolocy
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X NV1 NV1 NV2 SYS SYS NV2 SYS
GPU1 NV1 X NV2 NV1 SYS SYS SYS NV2
GPU2 NV1 NV2 X NV2 NV1 SYS SYS SYS
GPU3 NV2 NV1 NV2 X SYS NV1 SYS SYS
GPU4 SYS SYS NV1 SYS X NV2 NV1 NV2
GPU5 SYS SYS SYS NV1 NV2 X NV2 NV1
GPU6 NV2 SYS SYS SYS NV1 NV2 X NV1
GPU7 SYS NV2 SYS SYS NV2 NV1 NV1 X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
PSB = Connection traversing a single on-board PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
[root@iZ8vbazwei4j05nbediqaeZ inspect]#
device plugin 上报不了全部node的拓扑关系
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.