tencent / caelus Goto Github PK

View Code? Open in Web Editor NEW

336.0 336.0 81.0 1.06 MB

Set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs

License: Other

Makefile 0.08% Go 97.82% Dockerfile 0.17% Shell 1.93%

containerd docker hadoop kubernetes runtime yarn

caelus's People

Contributors

Stargazers

Watchers

Forkers

wangao1236 langyenan chaosju xiaonancc77 chenchun hex108 free-luowei ddongchen wondermen kitianfresh bsjs threestoneliu botieking98 gavinljj isgasho weiyanhua100 chenlingpeng blueblue-lee devhan2020 cityofwang yiwenshao kom0055 watermeion xing0821 sataqiu mabinbin0202 sencoder laashub-soa jiujuan zmberg benjaminhuang tzzcfrank yjxyy zuston 0x0034 shijieqin chenhong231 junxu yixingzhong jackiewang96 vanient bretagne-peiqi attlee-wang janeliul yelianjin jangocheng oceanchen2012 warmchang zyecho tianzichenone integra-hellsing chenmol xieydd nicholaswang haifzhu hm4radi tangcong luckyplusten isabella232 growing-luo xxlest sangshenya zhenkuang ryansxy houshanren wushuang-1997 geegee2006 njlkj kersus blakechiang4 fjding interstellarss xiaoxiaopan118 wl4g-k8s danek2003 fearblackcat zhy76 dragon-flyings 5l1v3r1 wolfboys wang-mask

caelus's Issues

Whether LinuxContainerExecutor could be supported on NM runs in Docker

Thanks for your great work on this project, especially for elastic Yarn with K8S.

Sorry i'm not familiar with K8S and so confused whether Hadoop LinuxContainerExecutor could be supported on NM runs in Docker natively.

If you have any ideas on it, please share it with me.

Fix the wrong PodSpec for nodemanager.yaml

As we know the PodSpec's containers field is array type, but typo in nodemanager.yaml

Some Feedback

#37 这次pr, 在pkg/caelus/predict/predict_local.go 文件,
mem := math.Max(memStats.UsageRss-memStats.UsageTotal, 0)
mem永远是0, 辛苦确认下是否符合预期

lighthouse组件是必须的吗

请教下离线作业在yarn上，not on k8s，是不是可以不用部署lighthouse和plugin server

/caelus/contrib/lighthouse-plugin$ make rpm
./hack/rpm
Sending build context to Docker daemon 137.7kB
Error response from daemon: failed to parse Dockerfile: Syntax error - can't find = in "M". Must be of the form: name=value
make: *** [rpm] 错误 1

运行二进制 ./caelus --v="2" --kubeconfig=config 找不到k8s 节点？？

no kind "hookConfiguration"

no kind "hookConfiguration"
hookConfiguration crd是怎么装上的啊

lighthouse 和 lighthouse-plugin 部署之后报错

lighthouse 和 lighthouse-plugin都部署了，kubelet也更改了相关参数，启动还是报错

kubelet 直接报错无法获取docker版本，

lighthouse 进程也报错：

I1209 15:13:22.711478 3037 hook_manager.go:164] Build router: post /containers/create
I1209 15:13:22.711633 3037 hook_manager.go:101] Hook manager is running
I1209 15:13:42.033089 3037 hook_manager.go:343] Unhandled request GET /info
I1209 15:13:42.033122 3037 log.go:184] http: proxy error: context canceled
I1209 15:13:42.033493 3037 hook_manager.go:343] Unhandled request GET /info
I1209 15:13:42.033520 3037 log.go:184] http: proxy error: context canceled
I1209 15:13:42.044698 3037 hook_manager.go:343] Unhandled request GET /versio

如何使用lighthouse插件

是修改kubelet启动参数 --container-runtime-endpoint 来指定使用lighthouse插件嘛？

收集指标问题

tutorial.md 中提到：
Multiple metrics supported, including cgroup metrics from cadvisor, node resource metrics, kernel metrics from eBPF, hardware events from PMU, and also Caelus collects online jobs latency from outside in the way of executable command or http server.
但代码中似乎没有看到有 kernel metrics from eBPF 这一项。

在容器中往/rootfs/etc写文件，报：Read-only file system

caelus/pkg/caelus/diskquota/manager/projectquota/projectfile.go

Lines 50 to 53 in 27d65d5

    
           if !util.InHostNamespace { 
        
           	projectsPath = path.Join(types.RootFS, projectsPath) 
        
           	projidPath = path.Join(types.RootFS, projidPath) 
        
           }

看已经直接挂载/etc/进入容器中，这里是否可以不需要了？

lighthouse运行报错

systemctl status lighthouse.service
● lighthouse.service - Lighthouse server
Loaded: loaded (/usr/lib/systemd/system/lighthouse.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since 一 2022-08-29 18:09:16 CST; 9min ago
Process: 57742 ExecStart=/usr/bin/lighthouse $ARGS (code=exited, status=255)
Main PID: 57742 (code=exited, status=255)

8月 29 18:09:16 host-241 systemd[1]: lighthouse.service: main process exited, code=exited, status=255/n/a
8月 29 18:09:16 host-241 lighthouse[57742]: F0829 18:09:16.070937 57742 server.go:54] failed complete: failed to decode hook configuration file "/etc/lighthouse/config.yaml", no kind "hookConfiguration" is registered for version "lighthouse.io/v1alpha1" in scheme "pkg/runtime/scheme.go:101"
8月 29 18:09:16 host-241 systemd[1]: Failed to start Lighthouse server.
8月 29 18:09:16 host-241 systemd[1]: Unit lighthouse.service entered failed state.
8月 29 18:09:16 host-241 systemd[1]: lighthouse.service failed.
8月 29 18:09:16 host-241 systemd[1]: lighthouse.service holdoff time over, scheduling restart.
8月 29 18:09:16 host-241 systemd[1]: start request repeated too quickly for lighthouse.service
8月 29 18:09:16 host-241 systemd[1]: Failed to start Lighthouse server.
8月 29 18:09:16 host-241 systemd[1]: Unit lighthouse.service entered failed state.
8月 29 18:09:16 host-241 systemd[1]: lighthouse.service failed.

ERROR: /sys/fs/cgroup/cpu/cpu.offline: no such file or directory

Hi,i would like to ask for help that when i deploy caelus on k8s, pods of caelus show the following logs:
I1208 17:04:24.202219 7830 feature_gate.go:243] feature gates: &{map[]}
I1208 17:04:24.202253 7830 types.go:490] current namespace is NOT host
E1208 17:04:24.208590 7830 cpubt.go:56] checking BT file(cpu.offline) err: stat /sys/fs/cgroup/cpu/cpu.offline: no such file or directory
I1208 17:04:24.208616 7830 types.go:708] cpu isolate auto detect is enabled, chosen manage policy is: quota
W1208 17:04:24.208624 7830 types.go:745] adding non-host namespace prefix for kubelet root dir
F1208 17:04:24.208639 7830 types.go:724] cpu manager file(/rootfs/data/cpu_manager_state) err: open /rootfs/data/cpu_manager_state: no such file or directory

I don't know if it is my miss of some steps?

干扰检测部分的实现有开源吗？

Add support for building caelus in a docker container

Add support for building caelus in a docker container, then we could build it on anywhere :)

/help

离线调度器哪里去了

离线调度器哪里去了，需要通过kubelet device plugin注册colocation/cpu、colocation/memory等新类型的硬件资源吗？

离线调度器准备开源吗，如果开源的话，大概什么时候开源

离线大框的一个问题

既然lighthouse的func (p *offlineMutator) mutate拦截执行了：

newSplits = append(newSplits, splits[1], offlineKey) //offlineKey = "offline"
newSplits = append(newSplits, splits[2:]...)
newCgroupParent := strings.Join(newSplits, string(filepath.Separator))
newCgroupParent = "/" + newCgroupParent
containerConfig.InnerHostConfig.CgroupParent = newCgroupParent

拦截执行以后，这些任务应该都在大框offline的cgroup父目录下面，那么，为啥还要有qos_k8s.go里面的moveOfflinePidsTogether？这里的moveOfflinePidsTogether是不是多余的？

不完善的README

因为最近在基于Caelus开始着手搞混部调度，但是看了一圈代码下来，人还是懵的。有几个问题请教下：

看代码中的使用方式，yarn必须得基于K8S吗？
Predict之类的统筹，目前只能基于单node，没有整个集群上的资源调度吗？
希望能出一个完善的傻瓜式README。感谢万分

	if !util.InHostNamespace {
	projectsPath = path.Join(types.RootFS, projectsPath)
	projidPath = path.Join(types.RootFS, projidPath)
	}