kubernetes-retired / poseidon Goto Github PK

View Code? Open in Web Editor NEW

410.0 30.0 84.0 33.34 MB

[EOL] A Firmament-based Kubernetes scheduler

Home Page: http://www.firmament.io

License: Apache License 2.0

Shell 30.32% Go 57.95% Makefile 6.19% Python 1.57% Dockerfile 1.36% Starlark 2.60%

k8s-sig-scheduling

poseidon's Introduction

Introduction

The Poseidon/Firmament scheduler incubation project is to bring integration of Firmament Scheduler OSDI paper in Kubernetes. At a very high level, Poseidon/Firmament scheduler augments the current Kubernetes scheduling capabilities by incorporating a new novel flow network graph based scheduling capabilities alongside the default Kubernetes Scheduler. Firmament models workloads on a cluster as flow networks and runs min-cost flow optimizations over these networks to make scheduling decisions.

Due to the inherent rescheduling capabilities, the new scheduler enables a globally optimal scheduling for a given policy that keeps on refining the dynamic placements of the workload.

As we all know that as part of the Kubernetes multiple schedulers support, each new pod is typically scheduled by the default scheduler. But Kubernetes can also be instructed to use another scheduler by specifying the name of another custom scheduler at the time of pod deployment (in our case, by specifying the 'schedulerName' as Poseidon in the pod template). In this case, the default scheduler will ignore that Pod and instead allow Poseidon scheduler to schedule the Pod on a relevant node.

Key Advantages

Flow graph scheduling provides the following
- Support for high-volume workloads placement.
- Complex rule constraints.
- Globally optimal scheduling for a given policy.
- Extremely high scalability.
NOTE: Additionally, it is also very important to highlight that Firmament scales much better than default scheduler as the number of nodes increase in a cluster.

Current Project Stage

Alpha Release

Design

Poseidon/Firmament Integration architecture

For more details about the design of this project see the design document.

Installation

In-cluster installation of Poseidon, please start here.

Development

For developers please refer here

Release Process

To view details related to coordinated release process between Firmament & Poseidon repos, refer here.

Latest Benchmarking Results

Please refer to link for detail throughput performance comparison test results between Poseidon/Firmament scheduler and Kubernetes default scheduler.

Roadmap

Release 0.9 onwards:
- Provide High Availability/Failover for in-memory Firmament/Poseidon processes.
- Scheduling support for “Dynamic Persistence Volume Provisioning”.
- Optimizations for reducing the no. of arcs by limiting the number of eligible nodes in a cluster.
- CPU/Mem combination optimizations.
- Transitioning to Metrics server API – Our current work for upstreaming new Heapster sink is not a possibility as Heapster is getting deprecated.
- Continuous running scheduling loop versus scheduling intervals mechanism.
- Priority Pre-emption support.
- Priority based scheduling.
Release 0.8 – Target Date 15th February, 2019:
- Pod Affinity/Anti-Affinity optimization in 'Firmament' code.
Release 0.7 – Target Date 19th November, 2018:
- Support for Max. Pods per Node.
- Co-Existence with Default Scheduler.
- Node Prefer/Avoid pods priority function.
Release 0.6 – Target Date 12th November, 2018:
- Gang Scheduling.
Release 0.5 – Released on 25th October 2018:
- Support for Ephemeral Storage, in addition to CPU/Memory.
- Implementation for Success/Failure of scheduling events.
- Scheduling support for “Pre-bound Persistence Volume Provisioning”.
Release 0.4 – Released on 18th August, 2018:
- Taints & Tolerations.
- Support for Pod anti-affinity symmetry.
- Throughput Performance Optimizations.
Release 0.3 – Released on 21st June, 2018:
- Pod level Affinity and Anti-Affinity implementation using multi-round scheduling based affinity and anti-affinity.
Release 0.2 – Released on 27th May, 2018:
- Node level Affinity and Anti-Affinity implementation.
Release 0.1 – Released on 3rd May, 2018:
- Baseline Poseidon/Firmament Scheduling capabilities using new multi-dimensional CPU/Memory cost model is part of this release. Currently, this does not include node and pod level affinity/anti-affinity capabilities. As shown below, we are building all this out as part of the upcoming releases.
- Entire test.infra BOT automation jobs are in place as part of this release.

poseidon's People

Contributors

Stargazers

Watchers

Forkers

linearregression noaz m1093782566 anksv containerz karunchennuri shivramsrivastava gokulchandrap huawei-cloudnative lernaeanhydra wgliang ravisantoshgudimetla dhilipkumars violet-guo nikita15p jiaxuanzhou shashidharatd wackxu hanxiaoshuai islinwb kevin-wangzefeng mozhuli stewart-yu nustee ant-caichu dongzhaoyu liang524132 xushaohui xiaoxubeii fengzixu agilebot1 k82cn ringtail daria-smirnova aoxn zyecho llhhbc yastij leftnoteasy moule3053 asifdxtreme pratikmeher44 dayuoba nickrenren naveensriram ozturkosu siva2krish parrasajad fantastic2085 isgasho jpedro1992 beautytiger panlichen 182yzh bysph lilyandlily mrbobbytables mdheller socioprophet riverzhang jasonrd rashmithak965 chauncey-77 makarov20211221 isabella232 yjuns sfowl pidb fjulio2018 spiffxp jackiewang96 fishcus leon1103 tianhao909 ravihari mkushal superbcoding benfu-verses fiindhedev ffatiherdem

poseidon's Issues

dev list about e2e

There are one works about e2e need to do:

As PR #97 mentioned, we should generte $HOME/.kube/config content rather copy it.
....

Local Cluster E2E test error

When I use test/e2e-poseidon-local.sh to run local cluster E2E test, it returns a a panic. I want to know what should be defined in kubeconfig file /root/.kube/config?

May 16 20:29:12.000: INFO: Location of the kubeconfig file /root/.kube/config
=== RUN   TestPoseidon
Running Suite: Poseidon Suite
=============================
Random Seed: 1526473751
Will run 7 of 7 specs

Panic [0.001 seconds]
[BeforeSuite] BeforeSuite
/home/gopath/src/github.com/kubernetes-sigs/poseidon/test/e2e/framework/framework.go:87

  Test Panicked
  invalid configuration: no configuration has been provided
  /home/go/src/runtime/panic.go:502

  Full Stack Trace
        /home/go/src/runtime/panic.go:502 +0x229
  github.com/kubernetes-sigs/poseidon/test/e2e/framework.(*Framework).BeforeEach(0xc4202b84b0)
        /home/gopath/src/github.com/kubernetes-sigs/poseidon/test/e2e/framework/framework.go:100 +0x509
  github.com/kubernetes-sigs/poseidon/test/e2e/framework.(*Framework).BeforeEach-fm()
        /home/gopath/src/github.com/kubernetes-sigs/poseidon/test/e2e/framework/framework.go:87 +0x2a
  github.com/kubernetes-sigs/poseidon/vendor/github.com/onsi/ginkgo/internal/leafnodes.(*runner).runSync(0xc4200935c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /home/gopath/src/github.com/kubernetes-sigs/poseidon/test/e2e/poseidon_suite_test.go:28 +0x64
  testing.tRunner(0xc4204d40f0, 0x1103040)
        /home/go/src/testing/testing.go:777 +0xd0
  created by testing.(*T).Run
        /home/go/src/testing/testing.go:824 +0x2e0

Use config. file instead of flag/parameters.

k8s_api_client.cc: Exception while waiting for pod list: not an object

Hello, I pulled your camsas/poseidon:dev docker image and wanted to have a try, I used the follow command:

$ docker run camsas/poseidon:dev /usr/bin/poseidon \
    --logtostderr \
    --k8s_apiserver_host=192.168.3.89 \
    --k8s_apiserver_port=8080 \
    --cs2_binary=/usr/bin/cs2.exe \
    --max_tasks_per_pu=50

But then an exception occurred, here were the logs:

I0521 05:59:45.096031     1 scheduler_bridge.cc:44] Firmament scheduler instantiated: 0x1b56650
I0521 05:59:45.096866     1 k8s_api_client.cc:63] Starting K8sApiClient for API server at http://192.168.3.89:8080/
I0521 05:59:45.187544     1 scheduler_bridge.cc:86] Adding new node's resource with RID f4673222-7577-ed4f-a0ec-d5a5c1da1c5d
E0521 05:59:45.262907     1 k8s_api_client.cc:313] Exception while waiting for pod list: not an object
...

And here are my pods in my k8s cluster:

$ kubectl get pod 
NAME                      READY     STATUS    RESTARTS   AGE
frontend-0d71m            1/1       Running   0          12d
frontend-75xjr            1/1       Running   0          12d
frontend-b54c8            1/1       Running   0          12d
redis-master-3wtcq        1/1       Running   0          12d
redis-slave-3fctv         1/1       Running   0          12d
redis-slave-gm9kk         1/1       Running   0          12d

And I looked into the poseidon's source code, it seemed that the APIClient couldn't get the cluster's pods:
Poll pods
And, my OS is CentOS7.2, not the suggested OS Ubuntu14.04... Does this matter?
So what should I do to slove this problem?
Thanks~

poseidon should cache the node of the cluster

1, cache the node obj to generate the resource view of the nodes in the cluster
2, optional but useful : capability to provide api to get the whole resource usage of the cluster

Create the official release for poseidon

With the rapid development of the project, it is necessary for us to periodically public the poseidon release. So I propose:

release cycle: monthly?
release content:
- zip and tar packages of source code.
- a docker image link, e.g. gcr.io/google_containers/poseidon-{ARCH}:{Version}. Supported ARCHs are amd64, arm, arm64, ppc64le, s390x and relase version is in the format of x.y.z.
- a release tar package which consists of poseidon deployment manifests for kubernetes and a saved docker image file.
- a brief description of notable changes.

@ms705 @ICGog @deepak-vij @shashidharatd @shivramsrivastava @dhilipkumars

Support for continuous running scheduling loop

Enable support for continuous running scheduling loop versus current scheduling intervals mechanism within Firmament Scheduling.

Failed to create pods when memory request higher than 50%

#91 adds e2e test on memory cost. Test ci-poseidon-e2e-gce always fails when memory request's higher than 50%.

e2e: Test Memory cost model

cc @islinwb

Firmament scheduler throws error when run with 1.7 kubernetes

I am facing a strange issue when using firmament container.

Observations:
Kubernetes version : 1.7

Firmament created using:

docker run  --net=host  camsas/firmament:dev /firmament/build/src/firmament_scheduler --flagfile=/firmament/default.conf

Poseidon able to connect to firmament and kubernetes.

Poseidon created using:

./poseidon --logtostderr --kubeConfig=<kubeconfig.cfg> --firmamentAddress=localhost:9090.

In standard out, I am able to see that both nodes and pods are being successfully watched and task is being submitted to firmament scheduler:

I0903 22:37:33.695642   20401 nodewatcher.go:132] enqueueNodeAdition: Added node sample-node-2gjww

I0903 22:37:33.700437   20401 podwatcher.go:176] enqueuePodAddition: Added pod {podname4 test}
I0903 22:37:33.701334   20401 podwatcher.go:176] enqueuePodAddition: Added pod {podname5 test}

In firmament stdout, I see this error:

I0904 02:37:42.510321    13 utils.cc:341] External execution of command: build/third_party/flowlessly/src/flowlessly-build/flow_scheduler --graph_has_node_types=true --algorithm=successive_shortest_path --print_assignments=true --daemon=false caling_factor= 
W0904 02:37:42.510821    15 solver_dispatcher.cc:143] STDERR from solver: E0904 02:37:42.510514    14 utils.cc:381] execvp failed for task command 'build/third_party/flowlessly/src/flowlessly-build/flow_scheduler --graph_has_node_types=true --algorithm=successive_shortest_path --print_assignments=true --daemon=false caling_factor= ': No such file or directory [2]

Result:
Pods are not being scheduled at all.

Just wanted to know, if I missed anything

Road map for E2E test in Poseidon

I have created a Google doc here, for discussing on the road map for the Poseidon E2E.
Please comment on the Google doc or in this issue directly.
If you feel it needs more addition please add those sections in the Google doc.
This is the list is not exhaustive but is a list of things which i feel we need to add as a part of E2E tests.
We can refine and come up with final task list.

@deepak-vij @shashidharatd @m1093782566

Auto detect and fetch K8S version from K8S cluster instead of using Kube version flag.

Support for node selector

There is a PR adding e2e test cases related to node selector, but it keeps failing now. Probably because node selector which poseidon scheduler doesn't work properly. We need to figure it out.

@deepak-vij @kevin-wangzefeng

/assign @m1093782566 @shivramsrivastava

refrain from using 'kubectl' command from out tests.

https://github.com/kubernetes-sigs/poseidon/blob/master/test/e2e/framework/framework.go#L115

fix todo:We need to refrain from using 'kubectl' command from out tests. use api to create service and deployment.
cc @m1093782566 @shivramsrivastava

Support for Node level Affinity and Anti-Affinity

Enable Node level affinity & anti-affinity functionality within Firmament scheduler as per the Kubernetes Node level affinity and anti-affinity Pod definitions.

Support for Pod level Affinity and Anti-Affinity

Enable Pod level affinity & anti-affinity functionality within Firmament scheduler as per the Kubernetes Pod level affinity and anti-affinity Pod definitions.

Make gRPC communication between Posidon & Firmament secure.

Support for Gang scheduling

Enable support for Gang scheduling within Firmament Scheduler. Some jobs cannot make progress unless all their tasks are running (for example, a synchronized iterative graph computation), while others can begin processing even as tasks are scheduled incrementally (e.g. a MapReduce job).

[feature] Ability to gracefully quit once received a INT signal

1, when there is a signal of os.syscall like (syscall.SIGINT, syscall.SIGTERM, syscall.SIGHUP), poseidon should complete the schduling jobs then gracefully shut down the service

Create poseidon v0.3 release notes

As we just released the poseidon v0.3 release, we should create a corresponding release note. There are already some good example available, reference: https://github.com/kubernetes-sigs/poseidon/tree/master/docs/releases

@hanxiaoshuai Can you do me a favor?

should record events during pod/node was watched, pod scheduled/deleted and failed tasks

as title described, events should be recorded to let the users know what happens during one scheduling task:
1, event for one pod watched, task submitted to firmament, failed/successful tasks from firmament and the reasons
2, event for on node watched/deleted

Create a SECURITY_CONTACTS file.

As per the email sent to kubernetes-dev[1], please create a SECURITY_CONTACTS
file.

The template for the file can be found in the kubernetes-template repository[2].
A description for the file is in the steering-committee docs[3], you might need
to search that page for "Security Contacts".

Please feel free to ping me on the PR when you make it, otherwise I will see when
you close this issue. :)

Thanks so much, let me know if you have any questions.

(This issue was generated from a tool, apologies for any weirdness.)

[1] https://groups.google.com/forum/#!topic/kubernetes-dev/codeiIoQ6QE
[2] https://github.com/kubernetes/kubernetes-template-project/blob/master/SECURITY_CONTACTS
[3] https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance-template-short.md

Poseidon pod stuck in "Init:0/1"

I followed the guide in https://github.com/kubernetes-sigs/poseidon/blob/master/docs/install/README.md and tried it in two VMs. The poseidon pod stuck in "Init:0/1" in one of the VMs.

docker logs <init-firmamentservice_contianer_ID>:

waiting for firmamentservice
nslookup: can't resolve 'firmament-service.kube-system'
Server:    10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local

docker exec -it <init-firmamentservice_contianer_ID> sh

[ root@poseidon-6d45696849-gxtpg:/ ]$ nslookup firmament-service.kube-system
Server:    10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local

nslookup: can't resolve 'firmament-service.kube-system'

[ root@poseidon-6d45696849-gxtpg:/ ]$ cat /etc/resolv.conf
nameserver 10.0.0.10
search kube-system.svc.cluster.local svc.cluster.local cluster.local huawei.com
options ndots:5

[ root@poseidon-6d45696849-gxtpg:/ ]$ nslookup firmament-service.kube-system.svc.cluster.local
Server:    10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local

Name:      firmament-service.kube-system.svc.cluster.local
Address 1: 10.0.0.160 firmament-service.kube-system.svc.cluster.local

After changing firmament-service.kube-system to firmament-service.kube-system.svc.cluster.local in https://raw.githubusercontent.com/kubernetes-sigs/poseidon/master/deploy/poseidon-deployment.yaml, it works.

initContainers:
      - name: init-firmamentservice
        image: radial/busyboxplus:curl
        command: ['sh', '-c', 'until nslookup firmament-service.kube-system; do echo waiting for firmamentservice; sleep 1; done;']

Earlier I suspect it's caused by the huawei.com in /etc/resolv.conf which is from the node's /etc/resolv.conf. But I comment search huawei.com in /etc/resolv.conf of the node but still failed to nslookup firmament-service.kube-system.

The heapster-poseidon image is only available for x86 arch

I am trying to run poseidon on PowerPC machines. However, I didn't find the source code of the image shivramsrivastava/heapster-poseidon, so that I cannot create an image for PowerPC arch.

In this case, could you please share the modified heapster source (heapster-poseidon), or generate the image for ppc64le arch?

What is the difference between the shivramsrivastava/heapster-poseidon and a regular heapster image?

Support for High Availability/Failover of in-memory Firmament/Poseidon processes

Enable High Availability/Failover capability for in-memory Firmament/Poseidon processes.

containerized poseidon: memory corruption

@ms705 Hi there!
Thank you so much for your amazing work (again). I've tried to bring up poseidon in docker, yet it seems to suffer some memory issue after it started to run for 2~3 min.
Please kindly take a look at the log if possible. Thanks so much.

root@ubuntu:~# docker logs 0f0e6ab6837e
I1205 02:11:00.477520     1 k8s_api_client.cc:47] Starting K8sApiClient for API server at http://10.10.103.67:8080/
I1205 02:11:00.481966     1 scheduler_integration.cc:134] Firmament scheduler instantiated: <FlowScheduler for coordinator >
I1205 02:11:00.526311     1 scheduler_integration.cc:145] Adding new node's resource with RID cf742c5b-2ef6-4b18-a93d-3785f5444d22
I1205 02:11:00.547468     1 scheduler_integration.cc:171] New unscheduled pod: busybox
I1205 02:11:00.548526     1 utils.cc:324] External execution of command: /usr/bin/cs2.exe 
I1205 02:11:00.662428     1 scheduler_integration.cc:189] Got 1 scheduling deltas
I1205 02:11:00.663835     1 scheduler_integration.cc:191] Delta: task_id: 6569851536870938357
resource_id: "cf742c5b-2ef6-4b18-a93d-3785f5444d22"
type: PLACE
I1205 02:11:00.672199    40 k8s_api_client.cc:72] Parsing binding response: {"apiVersion":"v1","code":409,"details":{"kind":"pods/binding","name":"busybox"},"kind":"Status","message":"Operation cannot be fulfilled on pods/binding \"busybox\": pod busybox is already assigned to node \"10.10.103.69\"","metadata":{},"reason":"Conflict","status":"Failure"}
I1205 02:11:00.673027     1 k8s_api_client.cc:201] Bound busybox to 10.10.103.67
I1205 02:11:10.695752     1 scheduler_integration.cc:171] New unscheduled pod: k8s-master-10.10.103.67
I1205 02:11:10.696748     1 utils.cc:324] External execution of command: /usr/bin/cs2.exe 
I1205 02:11:10.802476     1 scheduler_integration.cc:189] Got 1 scheduling deltas
I1205 02:11:10.803570     1 scheduler_integration.cc:191] Delta: task_id: 8144958691656263322
resource_id: "cf742c5b-2ef6-4b18-a93d-3785f5444d22"
type: PLACE
I1205 02:11:10.812801    46 k8s_api_client.cc:72] Parsing binding response: {"apiVersion":"v1","code":404,"details":{"kind":"pods","name":"k8s-master-10.10.103.67"},"kind":"Status","message":"pods \"k8s-master-10.10.103.67\" not found","metadata":{},"reason":"NotFound","status":"Failure"}
I1205 02:11:10.814126     1 k8s_api_client.cc:201] Bound k8s-master-10.10.103.67 to 10.10.103.67
I1205 02:11:20.839632     1 scheduler_integration.cc:171] New unscheduled pod: k8s-proxy-v1-7kvf0
I1205 02:11:20.840548     1 utils.cc:324] External execution of command: /usr/bin/cs2.exe 
I1205 02:11:20.952191     1 scheduler_integration.cc:189] Got 1 scheduling deltas
I1205 02:11:20.952741     1 scheduler_integration.cc:191] Delta: task_id: 7995058044728560469
resource_id: "cf742c5b-2ef6-4b18-a93d-3785f5444d22"
type: PLACE
I1205 02:11:20.964922    31 k8s_api_client.cc:72] Parsing binding response: {"apiVersion":"v1","code":404,"details":{"kind":"pods","name":"k8s-proxy-v1-7kvf0"},"kind":"Status","message":"pods \"k8s-proxy-v1-7kvf0\" not found","metadata":{},"reason":"NotFound","status":"Failure"}
I1205 02:11:20.965868     1 k8s_api_client.cc:201] Bound k8s-proxy-v1-7kvf0 to 10.10.103.67
I1205 02:11:30.994323     1 scheduler_integration.cc:171] New unscheduled pod: k8s-proxy-v1-eahvt
I1205 02:11:30.995163     1 utils.cc:324] External execution of command: /usr/bin/cs2.exe 
I1205 02:11:31.108824     1 scheduler_integration.cc:189] Got 1 scheduling deltas
I1205 02:11:31.109328     1 scheduler_integration.cc:191] Delta: task_id: 13663010032745704128
resource_id: "cf742c5b-2ef6-4b18-a93d-3785f5444d22"
type: PLACE
I1205 02:11:31.118139    34 k8s_api_client.cc:72] Parsing binding response: {"apiVersion":"v1","code":404,"details":{"kind":"pods","name":"k8s-proxy-v1-eahvt"},"kind":"Status","message":"pods \"k8s-proxy-v1-eahvt\" not found","metadata":{},"reason":"NotFound","status":"Failure"}
I1205 02:11:31.118768     1 k8s_api_client.cc:201] Bound k8s-proxy-v1-eahvt to 10.10.103.67
I1205 02:11:41.136282     1 scheduler_integration.cc:171] New unscheduled pod: k8s-proxy-v1-qce6o
I1205 02:11:41.136919     1 utils.cc:324] External execution of command: /usr/bin/cs2.exe 
I1205 02:11:41.192404     1 scheduler_integration.cc:189] Got 1 scheduling deltas
I1205 02:11:41.193066     1 scheduler_integration.cc:191] Delta: task_id: 18392520999753559797
resource_id: "cf742c5b-2ef6-4b18-a93d-3785f5444d22"
type: PLACE
I1205 02:11:41.199827    34 k8s_api_client.cc:72] Parsing binding response: {"apiVersion":"v1","code":404,"details":{"kind":"pods","name":"k8s-proxy-v1-qce6o"},"kind":"Status","message":"pods \"k8s-proxy-v1-qce6o\" not found","metadata":{},"reason":"NotFound","status":"Failure"}
I1205 02:11:41.199950     1 k8s_api_client.cc:201] Bound k8s-proxy-v1-qce6o to 10.10.103.67
I1205 02:11:51.220187     1 scheduler_integration.cc:171] New unscheduled pod: kube-addon-manager-10.10.103.67
I1205 02:11:51.220731     1 utils.cc:324] External execution of command: /usr/bin/cs2.exe 
I1205 02:11:51.331187     1 scheduler_integration.cc:189] Got 1 scheduling deltas
I1205 02:11:51.331946     1 scheduler_integration.cc:191] Delta: task_id: 7867583654750718473
resource_id: "cf742c5b-2ef6-4b18-a93d-3785f5444d22"
type: PLACE
I1205 02:11:51.339476    41 k8s_api_client.cc:72] Parsing binding response: {"apiVersion":"v1","code":404,"details":{"kind":"pods","name":"kube-addon-manager-10.10.103.67"},"kind":"Status","message":"pods \"kube-addon-manager-10.10.103.67\" not found","metadata":{},"reason":"NotFound","status":"Failure"}
I1205 02:11:51.339663     1 k8s_api_client.cc:201] Bound kube-addon-manager-10.10.103.67 to 10.10.103.67
I1205 02:12:01.368042     1 scheduler_integration.cc:171] New unscheduled pod: kube-dns-v20-o4p9l
I1205 02:12:01.369421     1 utils.cc:324] External execution of command: /usr/bin/cs2.exe 
I1205 02:12:01.494426     1 scheduler_integration.cc:189] Got 1 scheduling deltas
I1205 02:12:01.496582     1 scheduler_integration.cc:191] Delta: task_id: 829239843783619468
resource_id: "cf742c5b-2ef6-4b18-a93d-3785f5444d22"
type: PLACE
I1205 02:12:01.509184    45 k8s_api_client.cc:72] Parsing binding response: {"apiVersion":"v1","code":404,"details":{"kind":"pods","name":"kube-dns-v20-o4p9l"},"kind":"Status","message":"pods \"kube-dns-v20-o4p9l\" not found","metadata":{},"reason":"NotFound","status":"Failure"}
I1205 02:12:01.509935     1 k8s_api_client.cc:201] Bound kube-dns-v20-o4p9l to 10.10.103.67
I1205 02:12:11.530864     1 scheduler_integration.cc:171] New unscheduled pod: kubernetes-dashboard-v1.4.0-cblag
I1205 02:12:11.531860     1 utils.cc:324] External execution of command: /usr/bin/cs2.exe 
*** Error in `/usr/bin/poseidon': malloc(): memory corruption: 0x00007f5a578bd010 ***

Add CI status to the repo. using Travis

Support for Firmament Scheduling based on real-time metrics from Metrics Server API

Enable support for Firmament Scheduling based on real-time metrics from Metrics Server API. Currently Firmament uses new Heaspter sink for Poseidon to this regards. As Heapster is going to be deprecated, we need to transition to Metrics Server API instead.

Add support for max pod number hard requirement

User story:

If node A can only run 10 Pods and already has 10 Pods running on it. Then, any Pod can't be scheduled to node A until the number of Pods running on node A decreases to less than 10.

NOTE: can we resolve it via CPU most model?

Support for Taints and Tolerations

Enable Taints & Tolerations functionality within Firmament scheduler as per the Kubernetes Taints & Tolerations definitions.

State where we are and revise roadmap to help the project better moving forward

As discussed yesterday, to help the project better moving forward, we shall:

1. State where we are (with details)

Add description of the project development stage -- I'd say alpha or pre-alpha?
Since we not yet pass all scheduling conformance tests.
Figure out how far away behind the default scheduler at functionality wise.
The best way to do is run scheduling e2e test from the core repo (with arg --ginkgo.focus="sig-scheduling"),
and see how many tests fail.
Add CI status indicator
Just a basic thing every project shall do. To better track project healthy.

2. Revise roadmap and better backlog grooming

It's almost mid May now, while some of the items supposed to be done from initial roadmap still not yet implemented.
There's no milestone set, nor any priority labels set on issues, we shall add them.

3. Do other things... that help improve contributor experience

Comments are always welcome.

[k8s Incubation] Requesting project transfer

Hi @ms705 and @ICGog,

As discussed and agreed over the email, could you please initiate the transfer to this new github organization and to Tim please? This will be first step of incubating Poseidon into kubernetes eco-system.

Approvers (People with Commit access): @ms705 @ICGog @shivramsrivastava
Revieweres (People who can lgtm PRs) : @dhilipkumars [two more to be added post incubation]

cc: @timothysc @shivramsrivastava @deepak-vij

Hi @timothysc,

Shouldnt all of us be added into the kubernetes-sigs org for us to effectively manage the project? Could we do anything to make that happen like send an email to some one with list of members to be added.

Also do you think it would make sense to create two github groups in kubernetes-sigs organization such that

poseidon-approvers
poseidon-reviewers

Regards,
Dhilip

PS: We could keep this issue for discussing other logistic related problems to bring this task to a closer. WDYT?

Add pprof port to debug the performance of poseidon

As a runtime component within k8s cluster, need to provide the capacity of pprof debug api to the SAs.

Unfriendly import packages of github.com/kubernetes-sigs/poseidon/pkg

Just like:
https://github.com/kubernetes-sigs/poseidon/blob/master/docs/devel/README.md

Developers need to create new k8s.io directory under $GOPATH, but usually developers will directly fork poseidon to their $GOPATH/src/github.com/username/poseidon. If there is no k8s.io directory, go build will fail. If the developer modifies the code under github.com/kubernetes-sigs/poseidon/pkg, the changed code is not imported.

I suggest import kubernetes-sigs/poseidon/pkg/k8sclient directly or just using like staging of kubernetes.

Cannot build poseidon

I have tried to build poseidon by several different ways and no one has worked for me.

First, I tried the "go build ." and a got the following error:
can't load package: package k8s.io/poseidon: no Go files in /home/go/src/k8s.io/poseidon
I tired to google that, but could't fix it, do you know what is it missing?

Second, i just tried a simple "make", then, it does not find the kubernetes packages
can't load package: package k8s.io/kubernetes/hack/cmd/teststale: cannot find package "k8s.io/kubernetes/hack/cmd/teststale" in any of:
/home/go-src/src/k8s.io/kubernetes/hack/cmd/teststale (from $GOROOT)
/home/go/src/k8s.io/poseidon/_output/local/go/src/k8s.io/kubernetes/hack/cmd/teststale (from $GOPATH)
I found out that the /home/go/src/k8s.io/poseidon/_output/local/go/src/k8s.io/kubernetes/ points to /home/go/src/k8s.io/poseidon/ not to the real kubernetes folder in the GOPATH

Third, using the bazel : make bazel-build
then I get ERROR: infinite symlink expansion detected

Fourth, I have manually create the symbolic link with the kubernetes folder in the GOPATH into the /home/go/src/k8s.io/poseidon/_output/local/go/src/k8s.io/
It fix the infinite symlink, but fails with:
ERROR: error loading package '_output/local/go/src/k8s.io/kubernetes/pkg/generated/openapi': Extension file not found. Unable to load package for '//pkg/generated/openapi:def.bzl': BUILD file not found on package path
ERROR: error loading package '_output/local/go/src/k8s.io/kubernetes/pkg/generated/openapi': Extension file not found. Unable to load package for '//pkg/generated/openapi:def.bzl': BUILD file not found on package path
INFO: Elapsed time: 1.077s
FAILED: Build did NOT complete successfully (418 packages loaded)
currently loading: _output/local/go/src/k8s.io/kubernetes/pkg/printers/intern
alversion ... (7 packages)

Please, could you help to build Poseidon?

Maybe we need more documentation and architectural design instructions for scheduling_poseidon

collect scheduling time and performance metrics

1, add prometheus metrics exporters like scheduling time for performance monitoring

Support for Priority based Pre-emption

Enable support for Priority based Pre-emption as part of Firmament Scheduler.

The scheduling_ prefix to the repo name is no longer required

Feel free to rename the repo.

Maintainer rights

Hi @timothysc @bgrant0607 @spiffxp

As maintainers of this project, we need the ability

close issues / PRs
update issues and PR titles
Add remove labels on the issues.

We do not have such rights as an external collaborator. What is the best way to go about it?

maintainers: @shivramsrivastava @ICGog @irfanurrehman @shashidharatd @dhilipkumars

Run core e2e tests in the local cluster environment.

feature: support gpu scheduling

As mentioned, we need to support gpu scheduling as GPU is widely used in AI scenarios.

[Suggestion] add comment to more functions and critical code blocks

Recently I'm gonna do some secondary development on poseidon project, I find that few comments are in the project, which makes it harder to understand some complex functions or code blocks or causes some misunderstand. Good comments make reading source code more efficient, which could attract more community contributions. I'm appreciated if adding more and complete comments could be put on agenda.

Unexpected dense task placement for k8s cluster

Hi, I was running a containerized poseidon with camsas/poseidon:dev on the master node of a k8s cluster. Yet since it was properly brought up (or at least that's what I thought), it cannot help scheduling every pod onto the master node (10.10.103.67 in this case, as you could see in the log) while the other two worker nodes are completely lack of workload. The three machines involved are homogeneous.

Note that the first busybox is scheduled by kube-scheduler.

I was wondering if I should bring up another two poseidon instances to make things right ( 'cause seems that we should bring up a Firmament coordinator on every worker node of the cluster) or, it's just some kind of coincidence and everything is taken care of?

cc @ms705 Appreciate your help in advance.

root@ubuntu:~# kubectl get nodes
NAME           STATUS    AGE
10.10.103.67   Ready     4d
10.10.103.68   Ready     4d
10.10.103.69   Ready     4d
root@ubuntu:~# kubectl get pods -o wide
NAME                            READY     STATUS    RESTARTS   AGE       IP           NODE
busybox                         1/1       Running   96         4d        10.1.82.2    10.10.103.69
busybox-trident-basic           1/1       Running   25         1d        10.1.34.5    10.10.103.67
busybox-trident-basic-1         1/1       Running   0          38m       10.1.34.6    10.10.103.67
frontend-3223876880-1rl4q       1/1       Running   0          9m        10.1.34.7    10.10.103.67
frontend-3223876880-fof79       1/1       Running   0          9m        10.1.34.8    10.10.103.67
frontend-3223876880-rg3mi       1/1       Running   0          9m        10.1.34.9    10.10.103.67
redis-master-3804387969-ot2z9   1/1       Running   0          9m        10.1.34.10   10.10.103.67
redis-slave-368277221-nwkye     1/1       Running   0          9m        10.1.34.11   10.10.103.67
redis-slave-368277221-pgh1p     1/1       Running   0          9m        10.1.34.12   10.10.103.67

e2e failed in local-up-cluster

e2e failed in local-up-cluster. When I local-up-cluster, and use test/e2e-poseidon-local.sh to test E2E.

NAMESPACE       NAME                                   READY     STATUS    RESTARTS   AGE
kube-system     heapster-85994cc757-z7bcs              1/1       Running   0          24m
kube-system     kube-dns-659bc9899c-st6qz              3/3       Running   0          26m
poseidon-test   firmament-scheduler-7788b7d89b-g5hvx   1/1       Running   0          4m
poseidon-test   poseidon-5c4db977b7-555xh              1/1       Running   0          4m
poseidon-test   test-nginx-pod-2596996162              0/1       Pending   0          4m

Add Pod using Poseidon scheduler
 /var/paas/kaiyuan/src/github.com/kubernetes-sigs/poseidon/test/e2e/poseidon_integration.go:60
   using firmament for configuring pod
   /var/paas/kaiyuan/src/github.com/kubernetes-sigs/poseidon/test/e2e/poseidon_integration.go:62
     should succeed deploying pod using firmament scheduler [It]
     /var/paas/kaiyuan/src/github.com/kubernetes-sigs/poseidon/test/e2e/poseidon_integration.go:65

     Expected
         <string>: Pending
     to equal
         <string>: Running

There are some error in kube-controller-manager.log

I0517 14:13:33.385386   97129 endpoints_controller.go:375] Error syncing endpoints for service "poseidon-test/firmament-service", retrying. Error: Endpoints "firmament-service" is invalid: subsets[0].notReadyAddresses[0].ip: Invalid value: "127.0.0.1": may not be in the loopback range (127.0.0.0/8)
I0517 14:13:33.391838   97129 endpoints_controller.go:375] Error syncing endpoints for service "poseidon-test/firmament-service", retrying. Error: Endpoints "firmament-service" is invalid: subsets[0].notReadyAddresses[0].ip: Invalid value: "127.0.0.1": may not be in the loopback range (127.0.0.0/8)
I0517 14:13:33.403792   97129 endpoints_controller.go:375] Error syncing endpoints for service "poseidon-test/firmament-service", retrying. Error: Endpoints "firmament-service" is invalid: subsets[0].notReadyAddresses[0].ip: Invalid value: "127.0.0.1": may not be in the loopback range (127.0.0.0/8)
I0517 14:13:33.426011   97129 endpoints_controller.go:375] Error syncing endpoints for service "poseidon-test/firmament-service", retrying. Error: Endpoints "firmament-service" is invalid: subsets[0].notReadyAddresses[0].ip: Invalid value: "127.0.0.1": may not be in the loopback range (127.0.0.0/8)
I0517 14:13:33.467916   97129 endpoints_controller.go:375] Error syncing endpoints for service "poseidon-test/firmament-service", retrying. Error: Endpoints "firmament-service" is invalid: subsets[0].notReadyAddresses[0].ip: Invalid value: "127.0.0.1": may not be in the loopback range (127.0.0.0/8)
I0517 14:13:33.552074   97129 endpoints_controller.go:375] Error syncing endpoints for service "poseidon-test/firmament-service", retrying. Error: Endpoints "firmament-service" is invalid: subsets[0].notReadyAddresses[0].ip: Invalid value: "127.0.0.1": may not be in the loopback range (127.0.0.0/8)

[RFC] Software Release Management Proposal

Hi All,

Here is the initial draft of software release management workflow for Poseidon and Firmament.

https://docs.google.com/document/d/1rUkFFD4JB-jDjZhWv62El0_6h7o89GawIIjzNDa0Lyo/edit?usp=sharing

cc: @shivramsrivastava @timothysc @ICGog @ms705 @deepak-vij

e2e: Test node selector

Question about flag content in Docker run command

Hello @ICGog @ms705 ,
I wanna integrate Firmament with Kubernetes running locally, however, I got confused when I tried to run this command:
$ docker run camsas/poseidon:dev poseidon
--logtostderr
--kubeConfig=<path_kubeconfig_file>
--firmamentAddress=:
--statsServerAddress=:
--kubeVersion=<Major.Minor>

I don't know how to fill in path_kubeconfig_file and I don't know what the statsServerAddress means.
I will be grateful if somebody can help me!

when a e2e case finished, the pod was not deleted in time

kubectl get po --all-namespaces
NAMESPACE       NAME                                            READY     STATUS    RESTARTS   AGE
kube-system     kube-dns-659bc9899c-q2vdh                       3/3       Running   0          3h
poseidon-test   firmament-scheduler-7788b7d89b-9cwnx            1/1       Running   0          6m
poseidon-test   poseidon-86796db96-chslx                        1/1       Running   0          6m
poseidon-test   restricted-pod                                  0/1       Pending   0          28s
poseidon-test   test-nginx-deploy-4039455774-7bbd79f9c6-p2tx2   1/1       Running   0          6m
poseidon-test   test-nginx-deploy-4039455774-7bbd79f9c6-svn8x   1/1       Running   0          6m
poseidon-test   test-nginx-job-1879968118-dv7j6                 1/1       Running   0          5m
poseidon-test   test-nginx-job-1879968118-srtpq                 1/1       Running   0          5m
poseidon-test   test-nginx-rs-2854263694-czx47                  1/1       Running   0          6m
poseidon-test   test-nginx-rs-2854263694-ppfnw                  1/1       Running   0          6m
poseidon-test   test-nginx-rs-2854263694-v9rw5                  1/1       Running   0          6m

I0524 17:24:23.474194  107687 deployment_controller.go:573] Deployment poseidon-test/test-nginx-deploy-4039455774 has been deleted
I0524 17:24:23.479394  107687 deployment_controller.go:573] Deployment poseidon-test/test-nginx-deploy-4039455774 has been deleted
I0524 17:24:23.496520  107687 replica_set.go:477] Too few replicas for ReplicaSet poseidon-test/test-nginx-rs-2854263694, need 3, creating 3

Should add design document for node affinity

As we almost finished the review work on node selector and node affinity design for firmament/poseidon scheduler, should we add the markdown-type document to this project? e.g. create a docs/design folder and drop the document in.

BTW, contributors should send design document PR before starting feature developing.