openshift / cluster-kube-scheduler-operator Goto Github PK

Installs and maintains the kube-scheduler on a cluster.

License: Apache License 2.0

Makefile 1.34% Go 98.66%

cluster-kube-scheduler-operator's Introduction

Kubernetes Scheduler operator

The Kubernetes Scheduler operator manages and updates the Kubernetes Scheduler deployed on top of OpenShift. The operator is based on OpenShift library-go framework and it is installed via Cluster Version Operator (CVO).

It contains the following components:

Operator
Bootstrap manifest renderer
Installer based on static pods
Configuration observer

By default, the operator exposes Prometheus metrics via metrics service. The metrics are collected from following components:

Kubernetes Scheduler operator

Configuration

The configuration for the Kubernetes Scheduler is the result of merging:

a default config
an observed config (compare observed values above) from the spec schedulers.config.openshift.io.

All of these are sparse configurations, i.e. unvalidated json snippets which are merged in order to form a valid configuration at the end.

Scheduling profiles

The following profiles are currently provided:

HighNodeUtilization
LowNodeUtilization
NoScoring

Each of these enables cluster-wide scheduling. Configured via Scheduler custom resource:

$ oc get scheduler cluster -o yaml

apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  name: cluster
spec:
  mastersSchedulable: false
  policy:
    name: ""
  profile: LowNodeUtilization
  ...

HighNodeUtilization

This profile disables NodeResourcesBalancedAllocation and NodeResourcesFit plugin with (LeastAllocated type) and enables NodeResourcesFit plugin (with MostAllocated type). Favoring nodes that have a high allocation of resources. In the past the profile corresponded to disabling NodeResourcesLeastAllocated and NodeResourcesBalancedAllocation plugins and enabling NodeResourcesMostAllocated plugin.

LowNodeUtilization

The default list of scheduling profiles as provided by the kube-scheduler.

NoScoring

This profiles disabled all scoring plugins.

Profile Customizations (TechnicalPreview)

Customizations of existing profiles are available under the .spec.profileCustomizations field:

Name	Type	Description
`dynamicResourceAllocation`	`string`	Enable Dynamic Resource Allocation functionality

E.g.

apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  name: cluster
spec:
  mastersSchedulable: false
  policy:
    name: ""
  profile: HighNodeUtilization
  profileCustomizations:
    dynamicResourceAllocation: Enabled
  ...

Debugging

Operator also expose events that can help debugging issues. To get operator events, run following command:

$ oc get events -n  openshift-cluster-kube-scheduler-operator

This operator is configured via KubeScheduler custom resource:

$ oc describe kubescheduler

apiVersion: operator.openshift.io/v1
kind: KubeScheduler
metadata:
  name: cluster
spec:
  managementState: Managed
  ...

The log level of individual kube-scheduler instances can be increased by setting .spec.logLevel field:

$ oc explain kubescheduler.spec.logLevel
KIND:     KubeScheduler
VERSION:  operator.openshift.io/v1

FIELD:    logLevel <string>

DESCRIPTION:
     logLevel is an intent based logging for an overall component. It does not
     give fine grained control, but it is a simple way to manage coarse grained
     logging choices that operators have to interpret for their operands. Valid
     values are: "Normal", "Debug", "Trace", "TraceAll". Defaults to "Normal".

For example:

apiVersion: operator.openshift.io/v1
kind: KubeScheduler
metadata:
  name: cluster
spec:
  logLevel: Debug
  ...

Currently the log levels correspond to:

logLevel	log level
Normal	2
Debug	4
Trace	6
TraceAll	10

More about the individual configuration options can be learnt by invoking oc explain:

$ oc explain kubescheduler

The current operator status is reported using the ClusterOperator resource. To get the current status you can run follow command:

$ oc get clusteroperator/kube-scheduler

Developing and debugging the operator

In the running cluster cluster-version-operator is responsible for maintaining functioning and non-altered elements. In that case to be able to use custom operator image one has to perform one of these operations:

Set your operator in umanaged state, see here for details, in short:

oc patch clusterversion/version --type='merge' -p "$(cat <<- EOF
spec:
  overrides:
  - group: apps
    kind: Deployment
    name: kube-scheduler-operator
    namespace: openshift-kube-scheduler-operator
    unmanaged: true
EOF
)"

Scale down cluster-version-operator:

oc scale --replicas=0 deploy/cluster-version-operator -n openshift-cluster-version

IMPORTANT: This apprach disables cluster-version-operator completly, whereas previous only tells it to not manage a kube-scheduler-operator!

After doing this you can now change the image of the operator to the desired one:

oc patch pod/openshift-kube-scheduler-operator-<rand_digits> -n openshift-kube-scheduler-operator -p '{"spec":{"containers":[{"name":"kube-scheduler-operator-container","image":"<user>/cluster-kube-scheduler-operator"}]}}'

Developing and debugging the bootkube bootstrap phase

The operator image version used by the installer bootstrap phase can be overridden by creating a custom origin-release image pointing to the developer's operator :latest image:

$ IMAGE_ORG=<user> make images
$ docker push <user>/origin-cluster-kube-scheduler-operator

$ cd ../cluster-kube-apiserver-operator
$ IMAGES=cluster-kube-scheduler-operator IMAGE_ORG=<user> make origin-release
$ docker push <user>/origin-release:latest

$ cd ../installer
$ OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=docker.io/<user>/origin-release:latest bin/openshift-install cluster ...

Profiling with pprof

Enable profiling

By default the kube-scheduler profiling is disabled. The profiling can be enabled manually by editing config.yaml files under each master node.

Warning: the configuration gets undone after the new revision gets performed and the steps need to be repeated.

Steps:

access every master node (e.g. ssh or with oc debug)
1. edit /etc/kubernetes/static-pod-resources/kube-scheduler-pod-$REV/configmaps/config/config.yaml (where $REV corresponds to the latest revision) and set enableProfiling field to True.
2. make a benign change to /etc/kubernetes/manifests/kube-scheduler-pod.yaml, e.g. updating "Waiting for port" to "Waiting for port" (adding one blank space to the string). Wait for the updated pod manifest to be picked up and a new kube-scheduler instance running and ready.
oc port-forward pod/$KUBE_SCHEDULER_POD_NAME 10259:10259 in a separate terminal/window (where $KUBE_SCHEDULER_POD_NAME corresponds to a running kube-scheduler pod instance)

apply the following manifests to allow anonymous access:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: kubescheduler-anonymous-access
rules:
- nonResourceURLs: ["/debug", "/debug/*"]
  verbs:
  - get
  - list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
 name: kubescheduler-anonymous-access
roleRef:
 apiGroup: rbac.authorization.k8s.io
 kind: ClusterRole
 name: kubescheduler-anonymous-access
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: User
  name: system:anonymous

access https://localhost:10259/debug/pprof/

heap profiling

The tool requires to pull the heap file and the kube-scheduler binary.

Steps:

Pull the heap data by accessing https://localhost:10259/debug/pprof/heap
Extract the kube-scheduler binary from the corresponding image (by checking the kube-scheduler pod manifest):
```
$ podman pull --authfile $AUTHFILE $KUBE_SCHEDULER_IMAGE
$ podman cp $(podman create --name kube-scheduler $KUBE_SCHEDULER_IMAGE):/usr/bin/kube-scheduler ./kube-scheduler
```
- $AUTHFILE corresponds to your authentication file if not already located in the known paths
- $KUBE_SCHEDULER_IMAGE corresponds to the kube-scheduler image found in a kube-scheduler pod manifest
Run go tool pprof kube-scheduler heap

Dumping kube-scheduler's node cache

From https://github.com/kubernetes/kubernetes/blob/be77b0b82b01a3fc810118f095594ec8bdd3c3aa/pkg/scheduler/internal/cache/debugger/debugger.go#L58:

// CacheDebugger provides ways to check and write cache information for debugging.
// ListenForSignal starts a goroutine that will trigger the CacheDebugger's
// behavior when the process receives SIGINT (Windows) or SIGUSER2 (non-Windows).

When a kube-scheduler process receives SIGUSER2 the node cache gets dumped into the logs. E.g.:

I0105 03:32:31.936642       1 dumper.go:52] "Dump of cached NodeInfo" nodes=<
    Node name: NODENAME1
    Deleted: false
    Requested Resources: ...
    Scheduled Pods(number: 41):
    name: POD_NAME, namespace: POD_NAMESPACE, uid: 23c63c58-cc36-48be-97d9-f4f6088a709d, phase: Running, nominated node:
    name: POD_NAME, namespace: POD_NAMESPACE, uid: 04b3b3b4-52a3-46d0-b7ff-aa748eecd404, phase: Running, nominated node:
    ...


    Node name: NODENAME2
    Deleted: false
    Requested Resources: ...
    Scheduled Pods(number: 53):
    name: POD_NAME, namespace: POD_NAMESPACE, uid: 7cbce63f-3fb9-404a-a69b-6728592e6b2, phase: Running, nominated node:
    name: POD_NAME, namespace: POD_NAMESPACE, uid: 50bc7d7e-bd30-4c47-82ce-a9d3eb737434, phase: Running, nominated node:
    ...

cluster-kube-scheduler-operator's People

Contributors

Stargazers

Watchers

Forkers

aveshagarwal ravisantoshgudimetla juanvallejo mfojtik sttts smarterclayton sanchezl sjenning damemi gnufied derekwaynecarr deads2k danwinship soltysh s-urbaniak p0lyn0mial openshift-cherrypick-robot tnozicka paulfantom jhixson74 enj csrwng ingvagabund sallyom kodieglosseribm vareti marun adesso-as-a-service danehans pi-pi-miao bertinatto openshift-bot deepak1725 dmage jupierce guillaumerose osherdp cfergeau global-localhost global19 global19-atlassian-net stlaz wangzheng422 wangchen615 lack huffmanca godpeny isabella232 elbehery jerpeter1 deejross aojea atiratree qjkee jluhrsen isgasho dgrisonnet tkashem lm0943111262 pacevedom xueqzhan dobsonj neisw knelasevero derekhiggins prashanth684 lance5890 rhmdnd ardaguclu wking sairameshv vrutkovs omertuc liouk openshift-art-build-bot ashwindasr dinhxuanvu kramaranya tjungblu

cluster-kube-scheduler-operator's Issues

Error setting custom scheduler image with IMAGE env var

Trying to deploy a custom scheduler through the operator using the provided ${IMAGE} env var may fail due to ${IMAGE} being parsed for both the init container (wait-for-host-port) and the operand kube-scheduler.

In starter.go, the IMAGE env var is passed to NewTargetConfigController which uses it to replace the scheduler image in the pod yaml. However, the pod yaml uses ${IMAGE} in both the init container and the scheduler container.

What this means is if I build my own custom scheduler image and specify it in the env var, the scheduler pod will fail to launch if my custom scheduler is built in a way that doesn't provide /usr/bin/timeout. This should not be a requirement for the operand to provide additional commands (ie, all that should be required is the kube-scheduler command, although the operator additionally assumes that hyperkube will also be provided. Given that the scheduler is not built through hyperkube upstream anymore, this becomes problematic for users. So additionally, the command should be customizable).

$ oc get all -n openshift-kube-scheduler
pod/openshift-kube-scheduler-ci-ln-lx8b05k-f76d1-szgz7-master-0   0/2     Init:CreateContainerError   0          32s
pod/openshift-kube-scheduler-ci-ln-lx8b05k-f76d1-szgz7-master-1   2/2     Running                     0          179m
pod/openshift-kube-scheduler-ci-ln-lx8b05k-f76d1-szgz7-master-2   2/2     Running                     1          3h

$ oc describe pod/openshift-kube-scheduler-ci-ln-lx8b05k-f76d1-szgz7-master-0 -n openshift-kube-scheduler
...
Events:
  Type     Reason   Age   From                                         Message
  ----     ------   ----  ----                                         -------
  Normal   Pulling  40s   kubelet, ci-ln-lx8b05k-f76d1-szgz7-master-0  Pulling image "docker.io/mdame/kube-scheduler-119"
  Normal   Pulled   34s   kubelet, ci-ln-lx8b05k-f76d1-szgz7-master-0  Successfully pulled image "docker.io/mdame/kube-scheduler-119"
  Warning  Failed   34s   kubelet, ci-ln-lx8b05k-f76d1-szgz7-master-0  Error: container create failed: time="2020-07-20T18:03:00Z" level=error msg="container_linux.go:348: starting container process caused \"exec: \\\"/usr/bin/timeout\\\": stat /usr/bin/timeout: no such file or directory\""
container_linux.go:348: starting container process caused "exec: \"/usr/bin/timeout\": stat /usr/bin/timeout: no such file or directory"
  Warning  Failed  33s  kubelet, ci-ln-lx8b05k-f76d1-szgz7-master-0  Error: container create failed: time="2020-07-20T18:03:01Z" level=error msg="container_linux.go:348: starting container process caused \"exec: \\\"/usr/bin/timeout\\\": stat /usr/bin/timeout: no such file or directory\""
container_linux.go:348: starting container process caused "exec: \"/usr/bin/timeout\": stat /usr/bin/timeout: no such file or directory"
  Warning  Failed  22s  kubelet, ci-ln-lx8b05k-f76d1-szgz7-master-0  Error: container create failed: time="2020-07-20T18:03:12Z" level=error msg="container_linux.go:348: starting container process caused \"exec: \\\"/usr/bin/timeout\\\": stat /usr/bin/timeout: no such file or directory\""
container_linux.go:348: starting container process caused "exec: \"/usr/bin/timeout\": stat /usr/bin/timeout: no such file or directory"
  Normal   Pulled  9s (x3 over 33s)  kubelet, ci-ln-lx8b05k-f76d1-szgz7-master-0  Container image "docker.io/mdame/kube-scheduler-119" already present on machine
  Warning  Failed  9s                kubelet, ci-ln-lx8b05k-f76d1-szgz7-master-0  Error: container create failed: time="2020-07-20T18:03:25Z" level=error msg="container_linux.go:348: starting container process caused \"exec: \\\"/usr/bin/timeout\\\": stat /usr/bin/timeout: no such file or directory\""
container_linux.go:348: starting container process caused "exec: \"/usr/bin/timeout\": stat /usr/bin/timeout: no such file or directory"

As a user, I would like to be able to change the IMAGE and command for only the kube-scheduler container in order to run a custom scheduler image through the operator. However, this comes with the risk that users may run a broken scheduler and break their clusters. Given that the IMAGE env var isn't directly provided in the operator config, this seems sufficiently "hidden" as an advanced administrator/developer feature. It's also already provided, in all of our operators, but it's broken in this one.

With our operator only supporting deprecated Policy config, there is no other way for any user to customize the default scheduler.

/kind bug

Use rotated client cert to contact apiserver

The kube-apiserver operator rotates the client CA and creates control plane client certs. The one for the scheduler must be wired up as well, compare https://github.com/openshift/cluster-kube-apiserver-operator/pull/205/files#diff-ef02f4dd56d4bd19d7b3e4913b9a57d5R260.

scheduler is crashlooping, but clusteroperator reports available

The availabe condition should be false.

Is policy config file going to be supported?

In 4.0, do we plan to support policy config file for scheduler, if yes, how do we plan to do it with operators, if not, we need some clean up in the documentation.

scheduler config file apiVersion is deprecated.

Noticing the following issue:

WARNING: the provided config file is an unsupported apiVersion ("componentconfig/v1alpha1"), which will be removed in future releases

WARNING: switch to command-line flags or update your config file apiVersion to "kubescheduler.config.k8s.io/v1alpha1"

WARNING: apiVersions at alpha-level are not guaranteed to be supported in future releases

Add validation for bindata

Discrepancy in operator management state between kubectl patch vs targetconfigcontroller

When I wanted to increase the log level of kube-scheduler, noticed this.

In tragetconfigcontroller, I see it accepts Managed, Unamanged and Removed.
switch operatorSpec.ManagementState {
case operatorv1.Managed:
case operatorv1.Unmanaged:
return nil

case operatorv1.Removed:
	// TODO probably just fail
	return nil
default:
	c.eventRecorder.Warningf("ManagementStateUnknown", "Unrecognized operator management state %q", operatorSpec.ManagementState)
	return nil
}

While kubectl edit accepts only Force and Managed.

kubeschedulers.operator.openshift.io "cluster" was not valid:

* spec.managementState: Invalid value: "": spec.managementState in body should match '^(Managed|Force)$'

Not sure if this is an issue or an alternate way to solve.

Security: we should disable pprof in production environment

1 The debugging endpoint /debug/pprof is exposed over the unauthenticated 10251 port
2 This debugging endpoint can potentially leak sensitive information

Future Release Branches Frozen For Merging | branch:release-4.17 branch:release-4.18

The following branches are being fast-forwarded from the current development branch (master) as placeholders for future releases. No merging is allowed into these release branches until they are unfrozen for production release.

release-4.17
release-4.18

For more information, see the branching documentation.

Fix or disable TestMetricsAccessible e2e

This test was added in #191 and regularly flakes, then doesn't. Sometimes it will fail for literally 100's of CI runs before suddenly passing. Other times (like when we tried to fix it by extending the timeout #271) it will instantly pass with no problem.

The failure is due to the retry loop timing out with the metrics not accessible. It regularly blocks PRs and real bug fixes from merging for days. Because of this, and the effort that's already been put into investigating it without any progress, I would like to disable this test until we can dedicate time to finding a real solution.

/kind flake
/kind failing-test

Add Worker Node Status Conditions

Currently, kube-scheduler operator surfaces status conditions for master nodes. These conditions simplify troubleshooting cluster issues. In many instances, the inability to schedule a workload is due to worker node related issues. For example, a default cluster requires 2 ingress controllers (aka routers) that must run on worker nodes.

cc: @Miciah @ironcladlou

Restrict access to metrics to localhost

The kube-scheduler metrics are exposed via http, so we should restrict this to localhost
The ocp prometheus get kube-scheduler metrics by servicemonitor via https
The kube-scheduler health check is on 10259 port, and the 10251 is not used by the health check

add operand version reporting

openshift/library-go#326
openshift/cluster-kube-apiserver-operator#369

TraceAll loglevel should be 10

Many of the core scheduling decisions are logged at a level V(10). Because of this, we should consider the TraceAll operand log level to be 10 (instead of the recommended 8). Currently you need to manually edit the scheduler pod spec to get this value

How can I enable a scheduler extender in a way that is supported by the scheduler operator.

I see the configuration in the configmap openshift-kube-scheduler/config

config.yaml: |
{"apiVersion":"kubescheduler.config.k8s.io/v1beta3","clientConnection":{"kubeconfig":"/etc/kubernetes/static-pod-resources/configmaps/scheduler-kubeconfig/kubeconfig"},"kind":"KubeSchedulerConfiguration","leaderElection":{"leaderElect":true,"leaseDuration":"137s","renewDeadline":"107s","resourceLock":"configmapsleases","resourceNamespace":"openshift-kube-scheduler","retryPeriod":"26s"}}

but attempts to overwrite it are immediately reverted.

I want to add in a scheduler extender.

I found this section of the openshift documentation which suggests that the scheduler operator's config observer can merge together sparse json snippets to form the actual configuration. But I'm unable to figure out the details.

Thanks ahead,
Mike

Future Release Branches Frozen For Merging | branch:release-4.16 branch:release-4.17

release-4.16
release-4.17