Giter Site home page Giter Site logo

tencent / caelus Goto Github PK

View Code? Open in Web Editor NEW
336.0 336.0 81.0 1.06 MB

Set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs

License: Other

Makefile 0.08% Go 97.82% Dockerfile 0.17% Shell 1.93%
containerd docker hadoop kubernetes runtime yarn

caelus's People

Contributors

chaosju avatar chenlingpeng avatar ddongchen avatar jiwq avatar mymneo avatar testwill avatar threestoneliu avatar vanient avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

caelus's Issues

Some Feedback

#37 这次pr, 在pkg/caelus/predict/predict_local.go 文件,
mem := math.Max(memStats.UsageRss-memStats.UsageTotal, 0)
mem永远是0, 辛苦确认下是否符合预期

lighthouse make rpm 报错

/caelus/contrib/lighthouse-plugin$ make rpm
./hack/rpm
Sending build context to Docker daemon 137.7kB
Error response from daemon: failed to parse Dockerfile: Syntax error - can't find = in "M". Must be of the form: name=value
make: *** [rpm] 错误 1

lighthouse 和 lighthouse-plugin 部署之后报错

lighthouse 和 lighthouse-plugin都部署了 ,kubelet也更改了相关参数, 启动还是报错

kubelet 直接报错无法获取docker版本,

lighthouse 进程 也报错:

I1209 15:13:22.711478 3037 hook_manager.go:164] Build router: post /containers/create
I1209 15:13:22.711633 3037 hook_manager.go:101] Hook manager is running
I1209 15:13:42.033089 3037 hook_manager.go:343] Unhandled request GET /info
I1209 15:13:42.033122 3037 log.go:184] http: proxy error: context canceled
I1209 15:13:42.033493 3037 hook_manager.go:343] Unhandled request GET /info
I1209 15:13:42.033520 3037 log.go:184] http: proxy error: context canceled
I1209 15:13:42.044698 3037 hook_manager.go:343] Unhandled request GET /versio

收集指标问题

tutorial.md 中提到:
Multiple metrics supported, including cgroup metrics from cadvisor, node resource metrics, kernel metrics from eBPF, hardware events from PMU, and also Caelus collects online jobs latency from outside in the way of executable command or http server.
但代码中似乎没有看到有 kernel metrics from eBPF 这一项。

lighthouse运行报错

systemctl status lighthouse.service
● lighthouse.service - Lighthouse server
Loaded: loaded (/usr/lib/systemd/system/lighthouse.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since 一 2022-08-29 18:09:16 CST; 9min ago
Process: 57742 ExecStart=/usr/bin/lighthouse $ARGS (code=exited, status=255)
Main PID: 57742 (code=exited, status=255)

8月 29 18:09:16 host-241 systemd[1]: lighthouse.service: main process exited, code=exited, status=255/n/a
8月 29 18:09:16 host-241 lighthouse[57742]: F0829 18:09:16.070937 57742 server.go:54] failed complete: failed to decode hook configuration file "/etc/lighthouse/config.yaml", no kind "hookConfiguration" is registered for version "lighthouse.io/v1alpha1" in scheme "pkg/runtime/scheme.go:101"
8月 29 18:09:16 host-241 systemd[1]: Failed to start Lighthouse server.
8月 29 18:09:16 host-241 systemd[1]: Unit lighthouse.service entered failed state.
8月 29 18:09:16 host-241 systemd[1]: lighthouse.service failed.
8月 29 18:09:16 host-241 systemd[1]: lighthouse.service holdoff time over, scheduling restart.
8月 29 18:09:16 host-241 systemd[1]: start request repeated too quickly for lighthouse.service
8月 29 18:09:16 host-241 systemd[1]: Failed to start Lighthouse server.
8月 29 18:09:16 host-241 systemd[1]: Unit lighthouse.service entered failed state.
8月 29 18:09:16 host-241 systemd[1]: lighthouse.service failed.

ERROR: /sys/fs/cgroup/cpu/cpu.offline: no such file or directory

Hi,i would like to ask for help that when i deploy caelus on k8s, pods of caelus show the following logs:
I1208 17:04:24.202219 7830 feature_gate.go:243] feature gates: &{map[]}
I1208 17:04:24.202253 7830 types.go:490] current namespace is NOT host
E1208 17:04:24.208590 7830 cpubt.go:56] checking BT file(cpu.offline) err: stat /sys/fs/cgroup/cpu/cpu.offline: no such file or directory
I1208 17:04:24.208616 7830 types.go:708] cpu isolate auto detect is enabled, chosen manage policy is: quota
W1208 17:04:24.208624 7830 types.go:745] adding non-host namespace prefix for kubelet root dir
F1208 17:04:24.208639 7830 types.go:724] cpu manager file(/rootfs/data/cpu_manager_state) err: open /rootfs/data/cpu_manager_state: no such file or directory

I don't know if it is my miss of some steps?

离线调度器哪里去了

离线调度器哪里去了,需要通过kubelet device plugin注册colocation/cpu、colocation/memory等新类型的硬件资源吗?

离线大框的一个问题

既然lighthouse的func (p *offlineMutator) mutate拦截执行了:

newSplits = append(newSplits, splits[1], offlineKey) //offlineKey = "offline"
newSplits = append(newSplits, splits[2:]...)
newCgroupParent := strings.Join(newSplits, string(filepath.Separator))
newCgroupParent = "/" + newCgroupParent
containerConfig.InnerHostConfig.CgroupParent = newCgroupParent

拦截执行以后,这些任务应该都在大框offline的cgroup父目录下面,那么,为啥还要有qos_k8s.go里面的moveOfflinePidsTogether?这里的moveOfflinePidsTogether是不是多余的?

不完善的README

因为最近在基于Caelus开始着手搞混部调度,但是看了一圈代码下来,人还是懵的。有几个问题请教下:

  1. 看代码中的使用方式,yarn必须得基于K8S吗?
  2. Predict之类的统筹,目前只能基于单node,没有整个集群上的资源调度吗?
  3. 希望能出一个完善的傻瓜式README。感谢万分

lighthouse支持gRPC协议吗?

从目前开源的lighthouse来看,lighthouse仅仅支持http协议,如果后续将docker替换成containerd,kubelet直接向containerd发送gRPC的CRI请求,请求问一下,lighthouse还能继续支持吗?谢谢。

hadoop version problem

I has tried to deploy nm-operator with my hadoop cluster with version 2.6. but I found the yarn client doesn't work with yarn rmadmin -updateNodeResource command, how can i use nm-operator with lower version of hadoop?

/help

离线业务流量限制手段

看caelus源代码,发现 pkg/caelus/qos/manager/netio/netio.go, 代码文件里面是调用linux tc指令做的流量限额, 了解到 linux TC 只能做流出流量的限额, 这个是不是代表目前caelus 只能做离线任务流出流量的限额

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.