Giter Site home page Giter Site logo

cluster-kube-scheduler-operator's Introduction

Kubernetes Scheduler operator

The Kubernetes Scheduler operator manages and updates the Kubernetes Scheduler deployed on top of OpenShift. The operator is based on OpenShift library-go framework and it is installed via Cluster Version Operator (CVO).

It contains the following components:

  • Operator
  • Bootstrap manifest renderer
  • Installer based on static pods
  • Configuration observer

By default, the operator exposes Prometheus metrics via metrics service. The metrics are collected from following components:

  • Kubernetes Scheduler operator

Configuration

The configuration for the Kubernetes Scheduler is the result of merging:

  • a default config
  • an observed config (compare observed values above) from the spec schedulers.config.openshift.io.

All of these are sparse configurations, i.e. unvalidated json snippets which are merged in order to form a valid configuration at the end.

Scheduling profiles

The following profiles are currently provided:

Each of these enables cluster-wide scheduling. Configured via Scheduler custom resource:

$ oc get scheduler cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  name: cluster
spec:
  mastersSchedulable: false
  policy:
    name: ""
  profile: LowNodeUtilization
  ...

HighNodeUtilization

This profile disables NodeResourcesBalancedAllocation and NodeResourcesFit plugin with (LeastAllocated type) and enables NodeResourcesFit plugin (with MostAllocated type). Favoring nodes that have a high allocation of resources. In the past the profile corresponded to disabling NodeResourcesLeastAllocated and NodeResourcesBalancedAllocation plugins and enabling NodeResourcesMostAllocated plugin.

LowNodeUtilization

The default list of scheduling profiles as provided by the kube-scheduler.

NoScoring

This profiles disabled all scoring plugins.

Profile Customizations (TechnicalPreview)

Customizations of existing profiles are available under the .spec.profileCustomizations field:

Name Type Description
dynamicResourceAllocation string Enable Dynamic Resource Allocation functionality

E.g.

apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  name: cluster
spec:
  mastersSchedulable: false
  policy:
    name: ""
  profile: HighNodeUtilization
  profileCustomizations:
    dynamicResourceAllocation: Enabled
  ...

Debugging

Operator also expose events that can help debugging issues. To get operator events, run following command:

$ oc get events -n  openshift-cluster-kube-scheduler-operator

This operator is configured via KubeScheduler custom resource:

$ oc describe kubescheduler
apiVersion: operator.openshift.io/v1
kind: KubeScheduler
metadata:
  name: cluster
spec:
  managementState: Managed
  ...

The log level of individual kube-scheduler instances can be increased by setting .spec.logLevel field:

$ oc explain kubescheduler.spec.logLevel
KIND:     KubeScheduler
VERSION:  operator.openshift.io/v1

FIELD:    logLevel <string>

DESCRIPTION:
     logLevel is an intent based logging for an overall component. It does not
     give fine grained control, but it is a simple way to manage coarse grained
     logging choices that operators have to interpret for their operands. Valid
     values are: "Normal", "Debug", "Trace", "TraceAll". Defaults to "Normal".

For example:

apiVersion: operator.openshift.io/v1
kind: KubeScheduler
metadata:
  name: cluster
spec:
  logLevel: Debug
  ...

Currently the log levels correspond to:

logLevel log level
Normal 2
Debug 4
Trace 6
TraceAll 10

More about the individual configuration options can be learnt by invoking oc explain:

$ oc explain kubescheduler

The current operator status is reported using the ClusterOperator resource. To get the current status you can run follow command:

$ oc get clusteroperator/kube-scheduler

Developing and debugging the operator

In the running cluster cluster-version-operator is responsible for maintaining functioning and non-altered elements. In that case to be able to use custom operator image one has to perform one of these operations:

  1. Set your operator in umanaged state, see here for details, in short:
oc patch clusterversion/version --type='merge' -p "$(cat <<- EOF
spec:
  overrides:
  - group: apps
    kind: Deployment
    name: kube-scheduler-operator
    namespace: openshift-kube-scheduler-operator
    unmanaged: true
EOF
)"
  1. Scale down cluster-version-operator:
oc scale --replicas=0 deploy/cluster-version-operator -n openshift-cluster-version

IMPORTANT: This apprach disables cluster-version-operator completly, whereas previous only tells it to not manage a kube-scheduler-operator!

After doing this you can now change the image of the operator to the desired one:

oc patch pod/openshift-kube-scheduler-operator-<rand_digits> -n openshift-kube-scheduler-operator -p '{"spec":{"containers":[{"name":"kube-scheduler-operator-container","image":"<user>/cluster-kube-scheduler-operator"}]}}'

Developing and debugging the bootkube bootstrap phase

The operator image version used by the installer bootstrap phase can be overridden by creating a custom origin-release image pointing to the developer's operator :latest image:

$ IMAGE_ORG=<user> make images
$ docker push <user>/origin-cluster-kube-scheduler-operator

$ cd ../cluster-kube-apiserver-operator
$ IMAGES=cluster-kube-scheduler-operator IMAGE_ORG=<user> make origin-release
$ docker push <user>/origin-release:latest

$ cd ../installer
$ OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=docker.io/<user>/origin-release:latest bin/openshift-install cluster ...

Profiling with pprof

Enable profiling

By default the kube-scheduler profiling is disabled. The profiling can be enabled manually by editing config.yaml files under each master node.

Warning: the configuration gets undone after the new revision gets performed and the steps need to be repeated.

Steps:

  1. access every master node (e.g. ssh or with oc debug)

    1. edit /etc/kubernetes/static-pod-resources/kube-scheduler-pod-$REV/configmaps/config/config.yaml (where $REV corresponds to the latest revision) and set enableProfiling field to True.
    2. make a benign change to /etc/kubernetes/manifests/kube-scheduler-pod.yaml, e.g. updating "Waiting for port" to "Waiting for port" (adding one blank space to the string). Wait for the updated pod manifest to be picked up and a new kube-scheduler instance running and ready.
  2. oc port-forward pod/$KUBE_SCHEDULER_POD_NAME 10259:10259 in a separate terminal/window (where $KUBE_SCHEDULER_POD_NAME corresponds to a running kube-scheduler pod instance)

  3. apply the following manifests to allow anonymous access:

    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
     name: kubescheduler-anonymous-access
    rules:
    - nonResourceURLs: ["/debug", "/debug/*"]
      verbs:
      - get
      - list
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
     name: kubescheduler-anonymous-access
    roleRef:
     apiGroup: rbac.authorization.k8s.io
     kind: ClusterRole
     name: kubescheduler-anonymous-access
    subjects:
    - apiGroup: rbac.authorization.k8s.io
      kind: User
      name: system:anonymous
    
  4. access https://localhost:10259/debug/pprof/

heap profiling

The tool requires to pull the heap file and the kube-scheduler binary.

Steps:

  1. Pull the heap data by accessing https://localhost:10259/debug/pprof/heap
  2. Extract the kube-scheduler binary from the corresponding image (by checking the kube-scheduler pod manifest):
    $ podman pull --authfile $AUTHFILE $KUBE_SCHEDULER_IMAGE
    $ podman cp $(podman create --name kube-scheduler $KUBE_SCHEDULER_IMAGE):/usr/bin/kube-scheduler ./kube-scheduler
    • $AUTHFILE corresponds to your authentication file if not already located in the known paths
    • $KUBE_SCHEDULER_IMAGE corresponds to the kube-scheduler image found in a kube-scheduler pod manifest
  3. Run go tool pprof kube-scheduler heap

Dumping kube-scheduler's node cache

From https://github.com/kubernetes/kubernetes/blob/be77b0b82b01a3fc810118f095594ec8bdd3c3aa/pkg/scheduler/internal/cache/debugger/debugger.go#L58:

// CacheDebugger provides ways to check and write cache information for debugging.
// ListenForSignal starts a goroutine that will trigger the CacheDebugger's
// behavior when the process receives SIGINT (Windows) or SIGUSER2 (non-Windows).

When a kube-scheduler process receives SIGUSER2 the node cache gets dumped into the logs. E.g.:

I0105 03:32:31.936642       1 dumper.go:52] "Dump of cached NodeInfo" nodes=<
    Node name: NODENAME1
    Deleted: false
    Requested Resources: ...
    Scheduled Pods(number: 41):
    name: POD_NAME, namespace: POD_NAMESPACE, uid: 23c63c58-cc36-48be-97d9-f4f6088a709d, phase: Running, nominated node:
    name: POD_NAME, namespace: POD_NAMESPACE, uid: 04b3b3b4-52a3-46d0-b7ff-aa748eecd404, phase: Running, nominated node:
    ...


    Node name: NODENAME2
    Deleted: false
    Requested Resources: ...
    Scheduled Pods(number: 53):
    name: POD_NAME, namespace: POD_NAMESPACE, uid: 7cbce63f-3fb9-404a-a69b-6728592e6b2, phase: Running, nominated node:
    name: POD_NAME, namespace: POD_NAMESPACE, uid: 50bc7d7e-bd30-4c47-82ce-a9d3eb737434, phase: Running, nominated node:
    ...

cluster-kube-scheduler-operator's People

Contributors

atiratree avatar aveshagarwal avatar bertinatto avatar csrwng avatar damemi avatar danwinship avatar deads2k avatar dinhxuanvu avatar fedosin avatar ingvagabund avatar juanvallejo avatar knelasevero avatar marun avatar mfojtik avatar openshift-ci[bot] avatar openshift-merge-bot[bot] avatar openshift-merge-robot avatar p0lyn0mial avatar paulfantom avatar ravisantoshgudimetla avatar s-urbaniak avatar sanchezl avatar sjenning avatar smarterclayton avatar soltysh avatar stlaz avatar sttts avatar tkashem avatar tnozicka avatar wangchen615 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cluster-kube-scheduler-operator's Issues

Error setting custom scheduler image with IMAGE env var

Trying to deploy a custom scheduler through the operator using the provided ${IMAGE} env var may fail due to ${IMAGE} being parsed for both the init container (wait-for-host-port) and the operand kube-scheduler.

In starter.go, the IMAGE env var is passed to NewTargetConfigController which uses it to replace the scheduler image in the pod yaml. However, the pod yaml uses ${IMAGE} in both the init container and the scheduler container.

What this means is if I build my own custom scheduler image and specify it in the env var, the scheduler pod will fail to launch if my custom scheduler is built in a way that doesn't provide /usr/bin/timeout. This should not be a requirement for the operand to provide additional commands (ie, all that should be required is the kube-scheduler command, although the operator additionally assumes that hyperkube will also be provided. Given that the scheduler is not built through hyperkube upstream anymore, this becomes problematic for users. So additionally, the command should be customizable).

$ oc get all -n openshift-kube-scheduler
pod/openshift-kube-scheduler-ci-ln-lx8b05k-f76d1-szgz7-master-0   0/2     Init:CreateContainerError   0          32s
pod/openshift-kube-scheduler-ci-ln-lx8b05k-f76d1-szgz7-master-1   2/2     Running                     0          179m
pod/openshift-kube-scheduler-ci-ln-lx8b05k-f76d1-szgz7-master-2   2/2     Running                     1          3h

$ oc describe pod/openshift-kube-scheduler-ci-ln-lx8b05k-f76d1-szgz7-master-0 -n openshift-kube-scheduler
...
Events:
  Type     Reason   Age   From                                         Message
  ----     ------   ----  ----                                         -------
  Normal   Pulling  40s   kubelet, ci-ln-lx8b05k-f76d1-szgz7-master-0  Pulling image "docker.io/mdame/kube-scheduler-119"
  Normal   Pulled   34s   kubelet, ci-ln-lx8b05k-f76d1-szgz7-master-0  Successfully pulled image "docker.io/mdame/kube-scheduler-119"
  Warning  Failed   34s   kubelet, ci-ln-lx8b05k-f76d1-szgz7-master-0  Error: container create failed: time="2020-07-20T18:03:00Z" level=error msg="container_linux.go:348: starting container process caused \"exec: \\\"/usr/bin/timeout\\\": stat /usr/bin/timeout: no such file or directory\""
container_linux.go:348: starting container process caused "exec: \"/usr/bin/timeout\": stat /usr/bin/timeout: no such file or directory"
  Warning  Failed  33s  kubelet, ci-ln-lx8b05k-f76d1-szgz7-master-0  Error: container create failed: time="2020-07-20T18:03:01Z" level=error msg="container_linux.go:348: starting container process caused \"exec: \\\"/usr/bin/timeout\\\": stat /usr/bin/timeout: no such file or directory\""
container_linux.go:348: starting container process caused "exec: \"/usr/bin/timeout\": stat /usr/bin/timeout: no such file or directory"
  Warning  Failed  22s  kubelet, ci-ln-lx8b05k-f76d1-szgz7-master-0  Error: container create failed: time="2020-07-20T18:03:12Z" level=error msg="container_linux.go:348: starting container process caused \"exec: \\\"/usr/bin/timeout\\\": stat /usr/bin/timeout: no such file or directory\""
container_linux.go:348: starting container process caused "exec: \"/usr/bin/timeout\": stat /usr/bin/timeout: no such file or directory"
  Normal   Pulled  9s (x3 over 33s)  kubelet, ci-ln-lx8b05k-f76d1-szgz7-master-0  Container image "docker.io/mdame/kube-scheduler-119" already present on machine
  Warning  Failed  9s                kubelet, ci-ln-lx8b05k-f76d1-szgz7-master-0  Error: container create failed: time="2020-07-20T18:03:25Z" level=error msg="container_linux.go:348: starting container process caused \"exec: \\\"/usr/bin/timeout\\\": stat /usr/bin/timeout: no such file or directory\""
container_linux.go:348: starting container process caused "exec: \"/usr/bin/timeout\": stat /usr/bin/timeout: no such file or directory"

As a user, I would like to be able to change the IMAGE and command for only the kube-scheduler container in order to run a custom scheduler image through the operator. However, this comes with the risk that users may run a broken scheduler and break their clusters. Given that the IMAGE env var isn't directly provided in the operator config, this seems sufficiently "hidden" as an advanced administrator/developer feature. It's also already provided, in all of our operators, but it's broken in this one.

With our operator only supporting deprecated Policy config, there is no other way for any user to customize the default scheduler.

/kind bug

scheduler config file apiVersion is deprecated.

Noticing the following issue:

WARNING: the provided config file is an unsupported apiVersion ("componentconfig/v1alpha1"), which will be removed in future releases

WARNING: switch to command-line flags or update your config file apiVersion to "kubescheduler.config.k8s.io/v1alpha1"

WARNING: apiVersions at alpha-level are not guaranteed to be supported in future releases

Discrepancy in operator management state between kubectl patch vs targetconfigcontroller

When I wanted to increase the log level of kube-scheduler, noticed this.

In tragetconfigcontroller, I see it accepts Managed, Unamanged and Removed.
switch operatorSpec.ManagementState {
case operatorv1.Managed:
case operatorv1.Unmanaged:
return nil

case operatorv1.Removed:
	// TODO probably just fail
	return nil
default:
	c.eventRecorder.Warningf("ManagementStateUnknown", "Unrecognized operator management state %q", operatorSpec.ManagementState)
	return nil
}

While kubectl edit accepts only Force and Managed.

kubeschedulers.operator.openshift.io "cluster" was not valid:

* spec.managementState: Invalid value: "": spec.managementState in body should match '^(Managed|Force)$'

Not sure if this is an issue or an alternate way to solve.

Fix or disable TestMetricsAccessible e2e

This test was added in #191 and regularly flakes, then doesn't. Sometimes it will fail for literally 100's of CI runs before suddenly passing. Other times (like when we tried to fix it by extending the timeout #271) it will instantly pass with no problem.

The failure is due to the retry loop timing out with the metrics not accessible. It regularly blocks PRs and real bug fixes from merging for days. Because of this, and the effort that's already been put into investigating it without any progress, I would like to disable this test until we can dedicate time to finding a real solution.

/kind flake
/kind failing-test

Add Worker Node Status Conditions

Currently, kube-scheduler operator surfaces status conditions for master nodes. These conditions simplify troubleshooting cluster issues. In many instances, the inability to schedule a workload is due to worker node related issues. For example, a default cluster requires 2 ingress controllers (aka routers) that must run on worker nodes.

cc: @Miciah @ironcladlou

Restrict access to metrics to localhost

image

  1. The kube-scheduler metrics are exposed via http, so we should restrict this to localhost
  2. The ocp prometheus get kube-scheduler metrics by servicemonitor via https
  3. The kube-scheduler health check is on 10259 port, and the 10251 is not used by the health check

TraceAll loglevel should be 10

Many of the core scheduling decisions are logged at a level V(10). Because of this, we should consider the TraceAll operand log level to be 10 (instead of the recommended 8). Currently you need to manually edit the scheduler pod spec to get this value

How can I enable a scheduler extender in a way that is supported by the scheduler operator.

I see the configuration in the configmap openshift-kube-scheduler/config

config.yaml: |
{"apiVersion":"kubescheduler.config.k8s.io/v1beta3","clientConnection":{"kubeconfig":"/etc/kubernetes/static-pod-resources/configmaps/scheduler-kubeconfig/kubeconfig"},"kind":"KubeSchedulerConfiguration","leaderElection":{"leaderElect":true,"leaseDuration":"137s","renewDeadline":"107s","resourceLock":"configmapsleases","resourceNamespace":"openshift-kube-scheduler","retryPeriod":"26s"}}

but attempts to overwrite it are immediately reverted.

I want to add in a scheduler extender.

I found this section of the openshift documentation which suggests that the scheduler operator's config observer can merge together sparse json snippets to form the actual configuration. But I'm unable to figure out the details.

Thanks ahead,
Mike

Support kube scheduler ComponentConfig

I know right now we support policy config through the operator, but don't know what our status is on componentconfig. So making this issue to track that and look into it

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.