Giter Site home page Giter Site logo

gardener-attic / gardener-resource-manager Goto Github PK

View Code? Open in Web Editor NEW
5.0 5.0 19.0 19.88 MB

Kubernetes resource reconciliation controller capable of performing health checks for managed resources.

Home Page: https://gardener.cloud

addons controller

gardener-resource-manager's People

Contributors

andreasburger avatar deitch avatar gardener-robot avatar gardener-robot-ci-1 avatar gardener-robot-ci-2 avatar gardener-robot-ci-3 avatar harishmanasa avatar ialidzhikov avatar jia-jerry avatar mandelsoft avatar raphaelvogel avatar rfranzke avatar timebertt avatar timuthy avatar vpnachev avatar zanetworker avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gardener-resource-manager's Issues

Service ClusterIP is ignored

How to categorize this issue?

/area quality
/area usability
/kind bug
/priority normal

What happened:
grm applied a Service with .spec.clusterIP set (kube-dns with x.x.x.10) but it was discarded while merging and the Service got assigned a different cluster IP. That way, some the guestbook app failed to lookup DNS entries on x.x.x.10.

What you expected to happen:
grm to use the provided clusterIP.

How to reproduce it (as minimally and precisely as possible):

  • Apply a MR with a Service that has spec.clusterIP set
  • observe that the Service gets assigned a different IP

Anything else we need to know?:

Environment:

  • Gardener-Resource-Manager version: v0.13.0
  • Kubernetes version (use kubectl version): v1.18.2
  • Cloud provider or hardware configuration: all
  • Others:

Garbage Collector for immutable `ConfigMap`s/`Secret`s which are no longer in-use

How to categorize this issue?

/area robustness quality
/kind enhancement
/priority normal

What would you like to be added:
GRM could add a hash suffix to the names of ConfigMaps or Secrets that are mounted by Deployments/DaemonSets/etc. (it would have to adapt the names also in the .volumes[] section). The hash could be a checksum of the ConfigMap's/sSecret's .data map.

This should be done by default, however, it should be possible to

  • disable this entire behaviour via a CLI parameter
  • opt-out of this behaviour for particular resources via an annotation
  • opt-out of this behaviour for all resources of a ManagedResource via a field in the .spec

Additionally, the ConfigMaps/Secrets should be marked immutable.

Why is this needed:
GRM creates the Kubernetes resources managed by a ManagedResource in parallel. This can lead to issues when e.g. a Deployment is mounting a ConfigMap/Secret and the Deployment is updated before the ConfigMap/Secret and/or the update of the ConfigMap/Secret fails for whatever reason. In rare cases it can happen that a new pod is already rolled out and started but still mounting the old ConfigMap/Secret data. Eventually, GRM will also succeed updating the ConfigMap/Secret, however, the pod is already running and has started with the old configuration data. This can lead to issues when the new pod is incompatible with the old configuration data.

Another problematic case is when the kubelet's ConfigMap/Secret cache is stale. Also, very rarely, we had observed this in the past, i.e., the kubelet starts a pod but mounts the old/outdated configuration data. Eventually, the kubelet's cache will be up-to-date again, but the pod might have already been started and read the old config.

With above implementation (hash suffix) both problematic scenarios cannot happen anymore because a new pod would only start with the new ConfigMap/Secret.

Implementation considerations:
We would need a mechanism to cleanup the old/no longer used ConfigMaps/Secrets, perhaps with a separate controller.

/cc @timebertt @amshuman-kr @shreyas-s-rao @mvladev

Resources created with default namespace aren't deleted

Let's look at the following example:

apiVersion: v1
kind: Secret
metadata:
  name: managedresource-example2
  namespace: default
type: Opaque
data:
  other-objects.yaml: YXBpVmVyc2lvbjogYXBwcy92MSAjIGZvciB2ZXJzaW9ucyBiZWZvcmUgMS45LjAgdXNlIGFwcHMvdjFiZXRhMgpraW5kOiBEZXBsb3ltZW50Cm1ldGFkYXRhOgogIG5hbWU6IG5naW54LWRlcGxveW1lbnQKICBuYW1lc3BhY2U6IGRlZmF1bHQKc3BlYzoKICBzZWxlY3RvcjoKICAgIG1hdGNoTGFiZWxzOgogICAgICBhcHA6IG5naW54CiAgcmVwbGljYXM6IDIgIyB0ZWxscyBkZXBsb3ltZW50IHRvIHJ1biAyIHBvZHMgbWF0Y2hpbmcgdGhlIHRlbXBsYXRlCiAgdGVtcGxhdGU6CiAgICBtZXRhZGF0YToKICAgICAgbGFiZWxzOgogICAgICAgIGFwcDogbmdpbngKICAgIHNwZWM6CiAgICAgIGNvbnRhaW5lcnM6CiAgICAgIC0gbmFtZTogbmdpbngKICAgICAgICBpbWFnZTogbmdpbng6MS43LjkKICAgICAgICBwb3J0czoKICAgICAgICAtIGNvbnRhaW5lclBvcnQ6IDgwCg==
    # apiVersion: apps/v1
    # kind: Deployment
    # metadata:
    #   name: nginx-deployment
    # spec:
    #   selector:
    #     matchLabels:
    #       app: nginx
    #   replicas: 2 # tells deployment to run 2 pods matching the template
    #   template:
    #     metadata:
    #       labels:
    #         app: nginx
    #     spec:
    #       containers:
    #       - name: nginx
    #         image: nginx:1.7.9
    #         ports:
    #         - containerPort: 80

The Deployment is a namespaced resource but has no namespace specified in the metadata section. Consequently, the Gardener-Resource-Manager creates the resource with the default namespace either specified in the target Kubeconfig or by the in-cluster config via the mounted ServiceAccount (/var/run/secrets/kubernetes.io/serviceaccount/namespace).

Furthermore, the Deployment is written to the status of the ManagedResource but without any namespace information. After that the Gardener-Resource-Manager is not able to delete the resource any more.

To solve the issue we have to address two challenges:
1.) Dynamically find out what namespace is used when applying the resource or set it via a command line flag, e.g. --default-namespace.
2.) If the resource has no namespace specified, find out if it's a namespaced resource at all and set the namespace from 1.) if applicable. Therefore, we'd need to use a DiscoveryClient.

ManagedResource reconciliation takes more than 20m when APIService in unavailable

How to categorize this issue?

/area ops-productivity
/kind bug

What happened:

I see an Shoot cluster which SystemComponentsHealthy condition is flapping quite often between healthy and unhealthy:

    - type: SystemComponentsHealthy
      status: 'False'
      lastTransitionTime: '2021-02-15T09:55:34Z'
      lastUpdateTime: '2021-02-15T09:53:33Z'
      reason: ApplyFailed
      message: >-
        Could not apply all new resources: 1 error occurred: Operation cannot be
        fulfilled on horizontalpodautoscalers.autoscaling "coredns": the object
        has been modified; please apply your changes to the latest version and
        try again

When I check the logs of the gardener-resource-manager I see that shoot-core ManagedResource is reconciled for more than 20m:

$ cat ~/gardener-resource-manager-dccf959d5-kcpc4.log | grep '"object":"shoot--foo--bar/shoot-core"' | grep -v 'controller-runtime.manager.health-controller'

{"level":"info","ts":"2021-02-15T10:38:59.091Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T10:38:59.091Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T10:52:16.759Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T10:52:16.759Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T10:52:16.759Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:02:59.166Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:02:59.166Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:24:23.950Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:24:23.950Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:43:51.943Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:43:51.944Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:43:51.944Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:01:52.421Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:01:52.422Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:01:52.422Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:25:14.030Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:25:14.030Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:51:01.543Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:51:01.543Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:51:01.543Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:11:27.942Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:11:27.942Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:11:27.942Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:31:54.363Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:31:54.363Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:54:46.759Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:54:46.760Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:54:46.760Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}

What you expected to happen:

Reconcile of shoot-core ManagedResource to take up to several minutes at most.

How to reproduce it (as minimally and precisely as possible):
Not clear for now.

In the logs of gardener-resource-manager I see logs about throttling requests:

$ cat ~/gardener-resource-manager-dccf959d5-kcpc4.log | grep 'Throttling request'

# full output omitted

I0215 13:57:12.080636       1 request.go:621] Throttling request took 13.597998551s, request: GET:https://kube-apiserver/apis/crd.projectcalico.org/v1?timeout=32s
I0215 13:57:22.280577       1 request.go:621] Throttling request took 9.197729656s, request: GET:https://kube-apiserver/apis/servicecatalog.kyma-project.io/v1alpha1?timeout=32s
I0215 13:57:32.280683       1 request.go:621] Throttling request took 4.598267175s, request: GET:https://kube-apiserver/apis/settings.svcat.k8s.io/v1alpha1?timeout=32s
I0215 13:57:43.480787       1 request.go:621] Throttling request took 1.195492007s, request: GET:https://kube-apiserver/apis/apps/v1?timeout=32s
I0215 13:57:53.680666       1 request.go:621] Throttling request took 11.394955641s, request: GET:https://kube-apiserver/apis/serverless.kyma-project.io/v1alpha1?timeout=32s
I0215 13:58:03.880641       1 request.go:621] Throttling request took 6.99571505s, request: GET:https://kube-apiserver/apis/settings.svcat.k8s.io/v1alpha1?timeout=32s
I0215 13:58:14.080556       1 request.go:621] Throttling request took 2.597793766s, request: GET:https://kube-apiserver/apis/extensions/v1beta1?timeout=32s
I0215 13:58:24.080630       1 request.go:621] Throttling request took 12.597752422s, request: GET:https://kube-apiserver/apis/batch/v1?timeout=32s
I0215 13:58:34.280663       1 request.go:621] Throttling request took 8.107359855s, request: GET:https://kube-apiserver/apis/autoscaling/v1?timeout=32s
I0215 13:58:44.480527       1 request.go:621] Throttling request took 3.798117917s, request: GET:https://kube-apiserver/apis/install.istio.io/v1alpha1?timeout=32s
I0215 13:58:54.680493       1 request.go:621] Throttling request took 13.997952271s, request: GET:https://kube-apiserver/apis/ui.kyma-project.io/v1alpha1?timeout=32s
I0215 13:59:04.680516       1 request.go:621] Throttling request took 9.397612672s, request: GET:https://kube-apiserver/apis/extensions/v1beta1?timeout=32s
I0215 13:59:14.880495       1 request.go:621] Throttling request took 4.998066968s, request: GET:https://kube-apiserver/apis/flows.knative.dev/v1alpha1?timeout=32

Not sure, but from the logs it seems that gardener-resource-manager is doing quite a lot of discovery calls.

Anything else we need to know?:

Environment:

  • Gardener version: v1.16.2
  • gardener-resource-manager version: v0.21.0
  • Kubernetes version (use kubectl version): v1.18.12
  • Cloud provider or hardware configuration: Azure
  • Others:

spec.loadBalancerIP field of Service resource with type: LoadBalancer gets discarded

How to categorize this issue?

/area networking
/kind bug
/priority normal

What happened:

In https://github.com/gardener/gardener-extension-provider-packet, CCM sets spec.loadBalancerIP field for Services of type: LoadBalancer with the Elastic IP assigned to each service. This IP is then picked up by MetalLB and announced by Service.

However, for services managed by Gardener-Resource-Manager like kube-system/vpn-shoot, the spec.loadBalancerIP gets emptied every minute, when Gardener-Resource-Manager runs it's reconciliation loop. This breaks the Shoot VPN, making Shoot cluster state unstable.

What you expected to happen:

When spec.loadBalancerIP field is set on target resource and it is not specified on source resource, Gardener-Resource-Manager should merge both values instead of removing it from target resource.

How to reproduce it (as minimally and precisely as possible):

  1. Create a Service object managed by Resource managed without spec.loadBalancerIP set.
  2. Set spec.loadBalancerIP using API on target cluster.
  3. Wait up to a minute for Gardener-Resource-Manager to remove the value of this field.

Anything else we need to know?:

Perhaps https://github.com/gardener/gardener-resource-manager/blob/master/pkg/controller/managedresource/merger.go#L227 could be changed to preserve this field.

CC @timebertt @deitch

Environment:

  • Gardener-Resource-Manager version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration: Packet/EM
  • Others:

Add annotations for source of generated resources

How to categorize this issue?
/area control-plane
/kind enhancement
/priority normal

What would you like to be added:
When resources are generated, would it be possible to add annotations to the resources to document which ManagedResource and Secret that the resource was defined in? This would greatly enhance usability, especially in a downtime situation to figure out why resources keep getting replaced.

Why is this needed:
When the resource manager creates a resource, it's unclear where it is generated from. Sometimes the resource is wrong and needs to be live patched, but it keeps getting overwritten. Finding the source is difficult because the secrets are base64 encoded.

`foregroundDeletion` finalizer blocks deletion

How to categorize this issue?

/area usability
/kind bug
/priority normal

What happened:
The GRM is blocked because it waits for the deletion of a ClusterRole or ClusterRoleBinding to be deleted. However, both have the foregroundDeletion finalizer and cannot be deleted.

{"level":"info","ts":"2020-06-22T14:20:45.041Z","logger":"gardener-resource-manager.reconciler","msg":"Deletion is still pending","object":"shoot--dev--foo/shoot-core","err":"Could not clean all old resources: 9 errors occurred: [deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRole/default/system:vpa-admission-controller\" is still pending, deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRoleBinding/default/system:vpa-evictionter-binding\" is still pending, deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRole/default/system:vpa-checkpoint-actor\" is still pending, deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRoleBinding/default/system:vpa-admission-controller\" is still pending, deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRole/default/system:evictioner\" is still pending, deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRoleBinding/default/system:vpa-checkpoint-actor\" is still pending, deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRoleBinding/default/system:vpa-exporter-role-binding\" is still pending, deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRoleBinding/default/system:vpa-target-reader-binding\" is still pending, deletion of old resource \"rbac.authorization.k8s.io/v1/ClusterRoleBinding/default/system:metrics-reader\" is still pending]"}

Logs of KCM:

E0622 14:19:56.996355       1 garbagecollector.go:314] error syncing item &garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"rbac.authorization.k8s.io/v1", Kind:"ClusterRole", Name:"system:vpa-admission-controller", UID:"bee40e0d-5816-4fc2-84c3-024ae6312837", Controller:(*bool)(nil), BlockOwnerDeletion:(*bool)(nil)}, Namespace:""}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:1, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{}, deletingDependents:true, deletingDependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, beingDeleted:true, beingDeletedLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, virtual:false, virtualLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, owners:[]v1.OwnerReference(nil)}: clusterroles.rbac.authorization.k8s.io "system:vpa-admission-controller" is forbidden: user "system:serviceaccount:kube-system:generic-garbage-collector" (groups=["system:serviceaccounts" "system:serviceaccounts:kube-system" "system:authenticated"]) is attempting to grant RBAC permissions not currently held:

What you expected to happen:
GRM is not blocked.

How to reproduce it (as minimally and precisely as possible):
Let GRM manage a ClusterRole/ClusterRoleBinding and then delete the ManagedResource.

Environment:

  • Gardener-Resource-Manager version: master
  • Kubernetes version (use kubectl version): v1.18.0

Worker routines get stuck

How to categorize this issue?

/area robustness
/kind bug
/priority normal

What happened:

We have observed some situations, were grm gets stuck reconciling a specific managed resource and does not act upon it anymore.
In all cases I observed, it was either happening in conjunction with a longer period of downtime of the source or target API server (before #95) or a large amount of secret data in the target cluster (like described in #92).

What you expected to happen:

grm should not get stuck and reconcile all managed resources with the given sync interval.

How to reproduce it (as minimally and precisely as possible):

Not sure yet.
My guess would be that the worker goroutines get stuck in some WaitForCacheSync, when the API server is unavailable for a longer period of time or the amount of watched data is to big.

Anything else we need to know?:

Environment:

  • Gardener-Resource-Manager version: v0.20.0
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:

Services deletion blocked

How to categorize this issue?

/area usability
/kind bug
/priority blocker

What happened:
Deletion of ManagedResources with Services can be blocked because of kubernetes/kubernetes#91287.
TL;DR: the EndpointSlice controller tries to create new EndpointSlices although the respective Service is already marked for deletion. That way, the Service deletion is blocked.

Similar to #60

What you expected to happen:
GRM to be able to delete the Service.

How to reproduce it (as minimally and precisely as possible):
Create a ManagedResource with a Service, let it be created and delete it again.
Does not happen always but with a high probability.
Sometimes the deletion succeeds after a few minutes, when the garbage collection controller runs in the exact right moment.

Anything else we need to know?:

Environment:

  • Gardener-Resource-Manager version: 6993f8f
  • Kubernetes version (use kubectl version): v1.18.3
  • Cloud provider or hardware configuration: all
  • Others:

Finalizer is not removed from secret

What happened:
If you remove a secretRef from a MR, grm will not remove its finalizer from the referenced Secret.
Therefore the secret cannot be deleted without manual intervention.

With gardener/gardener#2210 (which creates dedicated secrets for each worker pool and references each of them in the respective MR) this will lead to a problem when g/grm is released, that the Shoot's namespace in the Seed cannot be deleted anymore, if the enduser has removed a worker pool from their Shoot (because it will contain Secrets with stale finalizers).

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

  1. Create a Shoot with 2 worker pools
  2. Remove one worker pool
  3. Delete the Shoot
  4. Observe that the Shoot's namespace in the Seed can't be deleted (pending deletion of secret for the second worker pool's cloud config)

Anything else we need to know:

Environment:

Switch leader election to lease

How to categorize this issue?

/area scalability
/kind enhancement
/priority normal

What would you like to be added:

Once kubernetes-sigs/controller-runtime#1147 has been released and vendored, we should come back to the default leader election settings again and rather switch to the lease object for leader election.

Why is this needed:

We have seen significant scalability problems, when running more than 100 Shoot grm instances in one Seed each doing leader election on configmaps every 2s.
We were able to workaround this issue with #72 and gardener/gardener#2668 but we should switch to leases for having a long-term solution.

Service with externalTrafficPolicy=Local can't be updated

When the field externalTrafficPolicy of a service is a Local, it can't be updated by Resource manager.

The error message is like this:

{
    "level": "error",
    "ts": "2020-01-18T13:59:03.230Z",
    "logger": "gardener-resource-manager.reconciler",
    "msg": "Could not apply all new resources",
    "object": "shoot--core--v15/addons",
    "error": "Errors occurred during applying: [error during apply of object \"v1/Service/kube-system/addons-nginx-ingress-controller\": Service \"addons-nginx-ingress-controller\" is invalid: spec.healthCheckNodePort: Invalid value: 0: cannot change healthCheckNodePort on loadBalancer service with externalTraffic=Local during update]",
    "stacktrace": "
    github.com/go-logr/zapr.(*zapLogger).Error
        /root/gopath/src/github.com/gardener/gardener-resource-manager/vendor/github.com/go-logr/zapr/zapr.go:128
    github.com/gardener/gardener-resource-manager/pkg/controller/managedresources.(*Reconciler).reconcile
        /root/gopath/src/github.com/gardener/gardener-resource-manager/pkg/controller/managedresources/controller.go:210
    github.com/gardener/gardener-resource-manager/pkg/controller/managedresources.(*Reconciler).Reconcile
        /root/gopath/src/github.com/gardener/gardener-resource-manager/pkg/controller/managedresources/controller.go:97
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /root/gopath/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /root/gopath/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232
    sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /root/gopath/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211
    k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
        /root/gopath/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152
    k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /root/gopath/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153
    k8s.io/apimachinery/pkg/util/wait.Until
        /root/gopath/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"
}

`status` of `VerticalPodAutoscaler` objects is not kept

How to categorize this issue?

/area usability
/kind bug
/priority normal

What happened:
The status is not kept when the GRM reconciles a VerticalPodAutoscaler object because the VerticalPodAutoscaler CRD does not support subresources.
Most likely all other CRDs without support for status subresource are affected as well.

What you expected to happen:
The GRM should not overwrite the status.

How to reproduce it (as minimally and precisely as possible):

  1. Create a VerticalPodAutoscaler
  2. See status being overwritten.

Anything else we need to know?:

Environment:

  • Gardener-Resource-Manager version: v0.12.0
  • Cloud provider or hardware configuration: all

Improve CheckDaemonSet func

How to categorize this issue?

/area ops-productivity
/kind bug
/priority normal

What happened:
Currently CheckDaemonSet func does not handle all cases in which DaemonSet is ready/not ready. See the following example:

$ k -n kube-system get po csi-disk-plugin-alicloud-xbwhj
NAME                             READY   STATUS             RESTARTS   AGE
csi-disk-plugin-alicloud-xbwhj   1/2     ImagePullBackOff   0          19m

$ k -n kube-system get ds csi-disk-plugin-alicloud
NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
csi-disk-plugin-alicloud   1         1         0       1            0           <none>          30m


$ k -n kube-system get ds csi-disk-plugin-alicloud -o yaml

status:
  currentNumberScheduled: 1
  desiredNumberScheduled: 1
  numberMisscheduled: 0
  numberReady: 0
  numberUnavailable: 1
  observedGeneration: 4
  updatedNumberScheduled: 1

And on the other hand the the corresponding MR is reported as healthy:

$ k -n shoot--foo--bar get mr extension-controlplane-shoot
NAME                           CLASS   APPLIED   HEALTHY   AGE
extension-controlplane-shoot           True      True      8h

What you expected to happen:
Similar issues to be catched by CheckDaemonSet func.

How to reproduce it (as minimally and precisely as possible):

  1. Create a mr that manages a DaemonSet. Set the image of the DaemonSet to non-existing one.

  2. Ensure that the mr is reported as healthy

Environment:

  • Gardener-Resource-Manager version: v0.17.0

/healthz endpoint for liveness probe

How to categorize this issue?

/area ops-productivity
/kind enhancement
/priority normal

What would you like to be added:
We should introduce a /healthz endpoint for a liveness probe that can be used in the Deployment manifest for GRM so that it is automatically restarted when it does not function properly.

Why is this needed:
I saw a running GRM pod that wasn't function, but it also didn't get restarted automatically (obviously). Those were the logs:

E1023 06:02:21.112646       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.Role: Get https://kube-apiserver/apis/rbac.authorization.k8s.io/v1/roles?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:21.113814       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.NetworkPolicy: Get https://kube-apiserver/apis/networking.k8s.io/v1/networkpolicies?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:21.114967       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1beta1.PodSecurityPolicy: Get https://kube-apiserver/apis/policy/v1beta1/podsecuritypolicies?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:21.116021       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v2beta1.HorizontalPodAutoscaler: Get https://kube-apiserver/apis/autoscaling/v2beta1/horizontalpodautoscalers?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
{"level":"info","ts":"2020-10-23T06:02:21.208Z","logger":"gardener-resource-manager.health-reconciler","msg":"Starting ManagedResource health checks","object":"shoot--foo--bar/extension-worker-mcm-shoot"}
{"level":"info","ts":"2020-10-23T06:02:21.208Z","logger":"gardener-resource-manager.health-reconciler","msg":"Starting ManagedResource health checks","object":"shoot--foo--bar/shoot-cloud-config-execution"}
{"level":"info","ts":"2020-10-23T06:02:21.208Z","logger":"gardener-resource-manager.health-reconciler","msg":"Skipping health checks for ManagedResource, as it is has not been reconciled successfully yet.","object":"shoot--foo--bar/extension-worker-mcm-shoot"}
{"level":"info","ts":"2020-10-23T06:02:21.208Z","logger":"gardener-resource-manager.health-reconciler","msg":"Skipping health checks for ManagedResource, as it is has not been reconciled successfully yet.","object":"shoot--foo--bar/shoot-cloud-config-execution"}
{"level":"info","ts":"2020-10-23T06:02:21.214Z","logger":"gardener-resource-manager.health-reconciler","msg":"Starting ManagedResource health checks","object":"shoot--foo--bar/extension-controlplane-shoot"}
{"level":"info","ts":"2020-10-23T06:02:21.214Z","logger":"gardener-resource-manager.health-reconciler","msg":"Skipping health checks for ManagedResource, as it is has not been reconciled successfully yet.","object":"shoot--foo--bar/extension-controlplane-shoot"}
{"level":"info","ts":"2020-10-23T06:02:21.218Z","logger":"gardener-resource-manager.health-reconciler","msg":"Starting ManagedResource health checks","object":"shoot--foo--bar/extension-controlplane-storageclasses"}
{"level":"info","ts":"2020-10-23T06:02:21.218Z","logger":"gardener-resource-manager.health-reconciler","msg":"Skipping health checks for ManagedResource, as it is has not been reconciled successfully yet.","object":"shoot--foo--bar/extension-controlplane-storageclasses"}
E1023 06:02:22.090404       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1beta1.PodDisruptionBudget: Get https://kube-apiserver/apis/policy/v1beta1/poddisruptionbudgets?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.097324       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.Secret: Get https://kube-apiserver/api/v1/secrets?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.098385       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.DaemonSet: Get https://kube-apiserver/apis/apps/v1/daemonsets?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.099407       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.ServiceAccount: Get https://kube-apiserver/api/v1/serviceaccounts?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.100688       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.Namespace: Get https://kube-apiserver/api/v1/namespaces?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.101760       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.ConfigMap: Get https://kube-apiserver/api/v1/configmaps?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.103386       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.ClusterRoleBinding: Get https://kube-apiserver/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.104123       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.ClusterRole: Get https://kube-apiserver/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.105368       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.Service: Get https://kube-apiserver/api/v1/services?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.106388       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.HorizontalPodAutoscaler: Get https://kube-apiserver/apis/autoscaling/v1/horizontalpodautoscalers?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.107496       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1beta1.CustomResourceDefinition: Get https://kube-apiserver/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.108534       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.Deployment: Get https://kube-apiserver/apis/apps/v1/deployments?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.109591       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.APIService: Get https://kube-apiserver/apis/apiregistration.k8s.io/v1/apiservices?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.110817       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.RoleBinding: Get https://kube-apiserver/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.111773       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.StorageClass: Get https://kube-apiserver/apis/storage.k8s.io/v1/storageclasses?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
E1023 06:02:22.112863       1 reflector.go:123] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: Failed to list *v1.Role: Get https://kube-apiserver/apis/rbac.authorization.k8s.io/v1/roles?limit=500&resourceVersion=0: write tcp 10.243.161.48:47662->10.243.55.204:443: write: broken pipe
...

Failing health checks for `APIService`

How to categorize this issue?

/area usability
/kind bug
/priority normal

What happened:

{"level":"info","ts":"2020-06-22T14:14:44.780Z","logger":"gardener-resource-manager.health-reconciler","msg":"could not create new object of kind for health checks (probably not registered in the used scheme), falling back to unstructured request","object":"shoot--dev--foo/shoot-core","GroupVersionKind":"apiregistration.k8s.io/v1, Kind=APIService","error":"no kind \"APIService\" is registered for version \"apiregistration.k8s.io/v1\" in scheme \"github.com/gardener/gardener-resource-manager/cmd/gardener-resource-manager/app/app.go:124\""}

What you expected to happen:
No error message

How to reproduce it (as minimally and precisely as possible):
Let GRM manage an APIService object.

Cannot update the container image of batch/v1 Job

How to categorize this issue?

/kind enhancement
/priority normal

What would you like to be added:

There was an issue with a Gardener Shoot most likely caused by the GRM.
Apparently, there is some merging logic when reconciling an existing resource with an updated resource.

In the logs below, you can see that the managed resource was unhealthy and that the batch/v1 Job had an invalid spec when applying it to the target cluster.

Network extension (shoot--dxp--trial/trial) reports failing health check: Health check summary: 1/1 unsuccessful, 0/1 progressing, 0/1 successful. Unsuccessful checks: 1) ManagedResourceUnhealthy: managed resource extension-networking-cilium-config in namespace shoot--dxp--trial is unhealthy: condition "ResourcesApplied" has invalid status False (expected True) due to ApplyFailed: Could not apply all new resources: 1 error occurred: error during apply of object "batch/v1/Job/kube-system/hubble-generate-certs": Job.batch "hubble-generate-certs" is invalid: spec.template: 

The issue got resolved after deleting the existing batch/v1 Job in the target cluster. This indicates that there is some problem with the merge of the existing and updated resource.

Why is this needed:

ManagedResource cannot apply when resource is without explicit namespace

Since 0.6.0 the namespace for the ManagedResource resources is required. If namespace is not provided, the g/gardener-resource-manager fails to apply the resources.

/kind bug

Steps to reproduce:

  1. Apply ManagedResource with resource without namespace
$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: managedresource-example1
type: Opaque
stringData:
  objects.yaml: |
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: test-1234
      annotations:
        resources.gardener.cloud/ignore: "true"
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: test-5678
EOF


$ cat <<EOF | kubectl apply -f -
apiVersion: resources.gardener.cloud/v1alpha1
kind: ManagedResource
metadata:
  name: example
  namespace: default
spec:
  secretRefs:
  - name: managedresource-example1
EOF
  1. Ensure that the apply fails.
$ make start
{"level":"info","ts":"2019-11-07T16:46:00.297+0200","logger":"gardener-resource-manager.reconciler","msg":"Applying","resource":"v1/ConfigMap/default/test-1234"}
{"level":"error","ts":"2019-11-07T16:46:00.312+0200","logger":"controller-runtime.controller","msg":"Reconciler error","controller":"resource-controller","request":"default/example","error":"Errors occurred during applying: [error during apply of object \"v1/ConfigMap/default/test-1234\": the server does not allow this method on the requested resource error during apply of object \"v1/ConfigMap/default/test-5678\": the server does not allow this method on the requested resource]",

/cc @vpnachev

Switch to DynamicRESTMapper for target client

How to categorize this issue?

/area scalability
/kind cleanup
/priority normal

What would you like to be added:

With the recent improvements in controller-runtime (kubernetes-sigs/controller-runtime#1151), we can/should replace the DeferredDiscoveryRESTMapper by a DynamicRESTMapper to avoid explicit rediscovery.

https://github.com/gardener/gardener-resource-manager/blob/827ae66273760c834b2dcf34459d8cf2fca5a387/pkg/controller/managedresources/controller.go#L463-L468

This way, grm immediately rediscovers resources if some of them are not known (yet), so that no extra reconciliation is needed.
It will still adhere to the specified rate limits, so we don't do unnecessary extraneous discovery calls (even across managed resources - which wasn't the case before).

managedresource stuck in deletion

Error on Gardener:

Flow "Shoot cluster deletion" encountered task errors: [task "Waiting until managed resources have been deleted" failed: retry failed with context deadline exceeded, last error: not all managed resources have been deleted in the shoot cluster (still existing: [addons])]

Message on addons managedresource:

Missing required Deployment with name "addons-nginx-ingress-nginx-ingress-k8s-backend".

Logs of resource manager pod:

{"level":"error","ts":"2019-11-25T13:13:04.240Z","logger":"gardener-resource-manager.reconciler","msg":"Deletion is still pending","object":"shoot--it--tmuog7z/addons","error":"Deletion of old resource v1/Service/kube-system/addons-nginx-ingress-controller is still pending","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/gardener/gardener-resource-manager/pkg/controller/managedresources.(*Reconciler).delete\n\t/go/src/github.com/gardener/gardener-resource-manager/pkg/controller/managedresources/controller.go:235\ngithub.com/gardener/gardener-resource-manager/pkg/controller/managedresources.(*Reconciler).Reconcile\n\t/go/src/github.com/gardener/gardener-resource-manager/pkg/controller/managedresources/controller.go:88\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
{"level":"error","ts":"2019-11-25T13:13:04.240Z","logger":"controller-runtime.controller","msg":"Reconciler error","controller":"resource-controller","request":"shoot--it--tmuog7z/addons","error":"Deletion of old resource v1/Service/kube-system/addons-nginx-ingress-controller is still pending","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

There are no pods or deployments left on the shoot cluster:

kubectl get all --all-namespaces
NAMESPACE     NAME                                      TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)                      AGE
default       service/kubernetes                        ClusterIP      100.104.0.1     <none>          443/TCP                      42h
kube-system   service/addons-nginx-ingress-controller   LoadBalancer   100.108.88.55   172.18.120.41   443:30376/TCP,80:31097/TCP   42h

Infrastructure: Openstack
Resouce Manager image version: 0.7.0
Shoot kubernetes version: 1.16.2

Shoot stuck in deletion due to stuck `ManagedResource`

How to categorize this issue?

/area usability
/kind bug
/priority normal

What happened:
Shoot was stuck in deletion with the following error:

Waiting until extension resources have been deleted" failed: 1 error occurred:
* Failed to delete Extension shoot--foo--bar/shoot-dns-service: Error deleting Extension resource: timed out waiting for the condition

Checking the logs of the gardener-resource-manager revealed:

{"level":"info","ts":"2020-10-21T13:09:03.264Z","logger":"gardener-resource-manager.reconciler","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/extension-shoot-dns-service-seed"}
{"level":"info","ts":"2020-10-21T13:09:03.264Z","logger":"gardener-resource-manager.reconciler","msg":"Starting to delete ManagedResource","object":"shoot--foo--bar/extension-shoot-dns-service-seed"}
{"level":"info","ts":"2020-10-21T13:09:03.843Z","logger":"gardener-resource-manager.reconciler","msg":"All resources have been deleted, removing finalizers from ManagedResource","object":"shoot--foo--bar/extension-shoot-dns-service-seed"}
{"level":"error","ts":"2020-10-21T13:09:07.325Z","logger":"controller-runtime.controller","msg":"Reconciler error","controller":"resource-controller","request":"shoot--foo--bar/extension-shoot-dns-service-seed","error":"error removing finalizer from ManagedResource: Operation cannot be fulfilled on managedresources.resources.gardener.cloud \"extension-shoot-dns-service-seed\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/gardener/gardener-resource-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

After that, the GRM never tried to remove the finalizer again, leading to the stuck ManagedResource in the system.

What you expected to happen:
GRM retries to remove its finalizer

How to reproduce it (as minimally and precisely as possible):
n/a yet, sorry

Environment:

  • Gardener-Resource-Manager version: v0.17.0
  • Kubernetes version (use kubectl version): v1.17.8

Add option to disable target cache

How to categorize this issue?

/area quality robustness cost
/kind bug
/priority normal

What happened:

We observed a Shoot cluster which had a lot of very large secrets (~3000 with ~1MiB each).
As the gardener-resource-manager caches all objects in the target cluster it touches, its cache had grown to an inacceptable size:

$ k top po
NAME                                          CPU(cores)   MEMORY(bytes)
gardener-resource-manager-77876b69b7-kvvkz    400m         5875Mi

Also, one of its worker routines trying to get a secret for applying was stuck, as the client tried to aquire a watch, but the API server panic'd (for some reasons), so the client waited forever for the watch cache to sync:

I1127 13:55:38.982551       1 trace.go:201] Trace[4834220]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224 (27-Nov-2020 13:54:00.969) (total time: 60013ms):
Trace[4834220]: [1m0.013079589s] [1m0.013079589s] END
E1127 13:55:38.982571       1 reflector.go:178] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:224: Failed to list *v1.Secret: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 57251; INTERNAL_ERROR

What you expected to happen:

grm should be resistent to such user behaviour and should therefore allow to disable the target cache via a command line flag.
This way, gardener can deploy grm instances for Shoots without a cache for the Shoot API server to avoid failing watches and minimize the memory footprint.

/cc @mandelsoft @timuthy

Environment:

  • Gardener-Resource-Manager version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.